Harvesting Web Semantics Augmented with Social Annotations · 2008-02-04 · Harvesting Web...

Harvesting Web Semantics Augmented with SocialAnnotations

Adam Gzella, Sebastian Ryszard Kruk, Tadhg Nagle, Edward Curry, Stefan Decker,and Jarosław Dobrzanski

Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland?

<firstname.lastname>@deri.org

Abstract. The amount of information on the web is constantly growing causingan erosion of its usefulness through information overload. Semantic Web copeswith this erosion by creating a web in which information is interlinked and ef-fectively queried. But how to transform the web information into semanticallyinterlinked metadata? Currently there is no convincing solution to this problem;a number of methods for providing/acquiring semantics were created but none ofthem is widely adopted. Additionally, a lot of semantics is delivered by the onlinecommunities in form of tags; there is no standard way to utilise this informationeither.In this paper we introduce a standard process of harvesting the web semantics.The process employs existing methods to collect or acquire data from web re-sources; it also enriches the gathered information with semantics from socialannotations. IKHarvester is a prototype that implements this process to gatheras much semantics as possible from the given source. As a result IKHarvestercreates and delivers interlinked, semantic data enriched with community annota-tions, accessible to other applications.

Keywords: semantics, harvesting, information extraction, social annotations

1 Introduction

With the web rapidly evolving, semantics is set to become the next big technologicalrevolution on the horizon [5]. Currently the demand for Semantic Web technologies inuse is growing, but it would not be possible to introduce them without satisfying amountof semantically interlinked metadata. There are many possible sources of semantics onthe Web; it is still, however, impossible to fully exploit potential behind those data.Moreover, there is a need to understand the broad range of semantic sources that arecurrently available throughout the web and a need to describe a process that capturesthe semantics provided from this broad range. For instance, the lack of functionalityfor harvesting tags and annotations from user communities (see Def. 1) and integrating

? This material is based upon works supported by Enterprise Ireland under Grant No. ILP/05/203and by Science Foundation Ireland Grant No. SFI/02/CE1/I131. Jarosław Dobrzanski was astudent of Gdansk University of Technology at the time of writing the original contribution tothis article. Authors thank all members of the DERI community for fruitful discussions on thisproject.

them with formal semantic descriptions (see Def. 2) is but one example of the fragmen-tation of semantic sources that currently exists. IKHarvester, the prototype described inthis paper was created to solve such problems and those connected with lack of seman-tically interlinked metadata. IKHarvester does it by harvesting web semantics from awide range of sources (also those that do not expose any explicit matadata) and incor-porating it with contributions from users and communities.

Definition 1. Soft semantics are non-authoritive metadata, usually contributed by on-line communities, very often in a form of tags.

Definition 2. Hard semantics are source driven, authoritative, highly interlinked meta-data; they are usually represented using RDF or Topic Maps.

2 Motivations

The motivations of web users come in many forms, from reconnecting with old friendsover social networks, to the serendipitous discovery of new knowledge during an af-ternoon surfing session. One common motivation for users is the need to research aparticular topic for either work or pleasure. The traditional approach for this task fol-lowed the familiar process of hunting for web resources/pages via search engines andweb directories. Once the user identified a valuable or relevant resource they chose tobookmark it for future reference or to cut and paste (i.e. harvest) the relevant informa-tion into a saveable medium such as a text editor.

With Web 2.0 the process of resource harvesting was enhanced by bookmark ser-vices, such as del.icio.us, that simplify the process of storing and tagging discoveredresources. Within a del.icio.us-type service, information is categorised using a non-hierarchical folksonomy where users tag their bookmarks with freely chosen keywords.Relationships between resources are determined based on the similarity of these arbi-trarily chosen keywords. However, keywords used to describe the resource are highlydependant on frame of mind of the user. While this approach is very effective wheninformation is harvested in the same context, building relationships between relevantcontent described using different terms and in different contexts is difficult. As an illus-tration of this limitation we present the challenges faced by knowledge workers, suchas a newspaper reporter, tasked with researching very different but often related topicsover an extended time-period.

In a typical week Sean, a reporter for an international current affairs magazine,needs to research and write an average of 2 articles. On a daily basis Sean discoversand bookmarks multiple resources for the articles he is currently writing. Sean wouldlike the ability to retrieve resources he has bookmarked for his current article and to seeany potential relationships to resources he has previously harvested. For example, Seanmust compile a report on the role of computer visionaries in shaping the world we livein. While researching this report he spent many hours carrying out detailed searches tofind relevant web content from a number of resources including webpage, wikipediaarticles, blog posts, and resources from digital libraries. Sean also has resources from apast article on famous fortune 100 CEOs. Sean bookmarked the results of both articles

Fig. 1. Resources bookmarked with social book-marking tool in two different contexts

Fig. 2. Semantic interconnections that mightbe harvested from bookmarked resources

using tagged bookmark service such as del.icio.us (see Fig. 1). He has enriched the re-source description by tagging them, in other words he add some soft semantics. Whatis significant here - documents have been bookmarked in different context, so resourcesabout the same person may not share a even single tag and cannot be associated. Sean isnot perfectly happy, he knows that the articles contained much more information abouta persons. A tool that would be able to harvest hard semantics from document wouldprovide Sean with more complete information. It would be full of valuable intercon-nections, that can be used to create relationships between resources. (see Fig. 2). Usingsuch information Sean can for example reason that the inventor of iPod and CEO ofApple is the same Steve Jobs. And this is just the beginning – semantically interlinkeddata provides much more possibilities than a simple keywords [11]. In the future Seanwill find information easier to reuse and discover new facts.

Soft semantics are important, as they allow community to easily describe resourcesin various context. But completing it with hard, Semantic Web based semantics, addinga vital relationships between resources (see Fig. 3), creates much more possibilities.Starting from simple reasoning to complex queries.

To summarise - our motivation was to ease the process of acquiring the informationabout a given topic. This can be done using a Semantic Web approach but only with wellannotated and interconnected resources data. As we show above the way to have suchinformation is to capture both soft and hard semantics from resources. In order to do sowe must first investigate a ways of finding and harvesting semantic information from theweb, to this end we present a blended approach to semantics harvesting implementedthrough the IKHarvester prototype.

Fig. 3. Interconnected, well annotated resources - effect of combining soft and hard semantics

3 Semantics on the Web

Early research and development on the Semantic Web concentrated on delivering spec-ifications of new languages, capable of capturing the meaning of information. Fur-ther development, continuing until today, brought more user-oriented applications. Notmuch though, however, has been put into the question ”where the semantic data willcome from?”

Semantics are more complex and require more effort than creating a simple HTMLpage. To bootstrap the adoption of the Semantic Web in mass web development a num-ber of solutions has been proposed. These solutions range from linking RDF documentsfrom the HTML pages to embedding concepts into the HTML pages to transformingHTML pages to RDF to performing natural language processing on a non-semanticsource.

RDF, apart maybe from TopicMaps, is the key standard for representing the mostsimple semantics. RDF determines which web resource is described through the notionof URIs. There is, however, no single way of expressing that a given web resource isannotated with a given part of the RDF graph.

We can classify the solutions for acquiring semantics related to web resources inthree high level categories (see Fig. 4). Each category contains one or more types ofsolutions:

– Providing semantics together with web resources; it is creator choice to deliversemantics together with the web resource.

– Acquiring semantics indirectly from the web resources; the semantics which arepart of the resource but are not in the formal representation can be extracted. It canbe also an external web service can be called to acquire semantics about the givenresource.

Fig. 4. Comprehensive classification of sources of Semantics

– Social (soft) semantics are those delivered by Web 2.0 applications; their meaningis not always strictly defined, but they form an important part of the web resourceannotations.

In this section we will briefly describe each of the types of solutions for acquiringsemantics for web resources.

Linking Semantics from Web Resources Probably the easiest way to attach semanticsto the web page is by using a standard <link>1 element. Users can prepare a documentwith an RDF description of the given web resource. Although the binding procedureitself is very simple there are two problems with this solution: (1) The creator of theweb resource has to know RDF to create the description. (2) The description is relatedto the whole document only.

This solution has been, however, successfully adapted by the SIOC [4] exporters forpopular blogging and fora platforms and by SemMediaWiki [15] the popular SemanticWeb extension to MediaWiki engine.

Semantics Embedded in Web Resources Embedding a semantic description in a webresource can have two advantages compared to using the LINK element: (1) It is possi-ble to describe a specific part of the resource. (2) In some cases the creator of the webresource does not have to understand the full-fledged semantic technologies, such asRDF.

1 for example: <link rel="meta" title="RDF" type="application/rdf+xml"href="http://myblog.net/myrdf.rdf"/>

There have been a few ways for embedding semantics. One of them is Microfor-mats [14]. It approach semantics from a very focused perspective. This solution ex-presses certain type of information, e.g., calendar data, social network, using standardHTML elements. Similar, but a more flexible solution is Embedded RDF [7]. Creatorof a web resource can reference metadata definitions used in the document with theLINK element. Each property used to annotate the portion of a web resource can beintroduced using rel, rev, and class attributes, and META element. A slightly dif-ferent approach is presented by the RDF/A specification [2]. Instead of limiting theexpressiveness of RDF when embedded with standard HTML specification, RDF/A in-troduces new attributes that help to express a full meaning inside HTML documents.

The variety of solutions for embedding semantics into HTML and XML can grow;therefore, W3C introduces a standard mechanism for introducing known solutions.GRDDL [6] is a mechanism for Gleaning Resource Descriptions from Dialects of Lan-guages. GRDDL defines a markup based on existing standard for declaring that a givenXML or HTML resource includes RDF information. GRDDL also links to algorithms,e.g., XSLT documents, that can extract RDF from the referencing XML document.In particular, aforementioned solutions (Microformats, eRDF, and RDF/A) have theirstandard GRDDL-compatible solutions for extracting semantics.

Semantics Controlled by CMS Many current web sites are not manually created, butrendered by content management servers (CMS). The plethora of CMS solutions makesit almost impossible to write exporter plugins for all of them, as the SIOC group did formost popular social publishing engines. There are, however, solutions which are beingadopted by various CMS solutions:

– SPARQL endpoint Exposing the semantics through W3C standard SPARQL end-point.

– REST API Approach popularized by Web 2.0 solutions. It can be also used to exposeaccess to semantic-enabled services [12]. It usually allows to modify information,which is not possible with SPARQL.

– HTTP Negotiation The HTTP content negotiation [3] provides a solution, based onstandard HTTP specifications, to publish RDF and a visual representation underone URL. Based on the Accept header in the HTTP request the server can decidewhere it should redirect the client to: an RDF document or a HTML web page.

Transformation of the Content of the Web Resource In already mentioned solutionsit was the choice of the creator of either the page or the platform to provide the metadatain the web documents. Very often creators of the pages or CMSes do not see a need toprovide access to the semantics, sometimes they are unwilling to do so (e.g. Bloggerplatform).

Usually the popular sources of information have similar structure (for exampleWikipedia articles are similar in terms of structure). So it possible to create a toolsthat are able to translate the content of the web resource to create semantics. There arefew such solutions delivered. Probably the most interesting ones are those supported by

the SIMILE2 project. RDFizers3 package provide a number of tools that convert popularinformation sources, e.g., email, BibTEX, Java source, to RDF. With Solvent4 one caneasily deliver JavaScript code with XPath definitions that can be used by PiggyBank5 toturn a given type of resource into RDF. Similar solutions are offered by MarcOnt Me-diation Services [16], where translation rules to and from legacy formats, e.g., BibTEXor MARC21, are defined in the process of creating the ontology.

Semantic Services One of interesting features about RDF is that it assumes the openworld reasoning and the information about the resource can be distributed across theWeb. This poses new problems. Not only is getting the semantics related to a resourcecumbersome, but trying to get all or most of them becomes almost impossible. That iswhere interaction with other semantic services becomes handy. There are not too manyof them at this moment, though. DBPedia6 provides a SPARQL endpoint to informationextracted from the Wikipedia articles. Swoogle7 indexes RDF and OWL documents ac-cessible on the Internet. It can be very useful to understand unknown ontology conceptsused in the semantics acquired for the given web resource. Sindice [20] is a service thatfinds other sources delivering semantics about a given URI. Finally, PingTheSeman-ticWeb8 is a service that gives access to semantics about the given resource, submittedby their creators or exporting plugins.

Social Services Social services, such as del.icio.us or Flickr, became popular togetherwith Web 2.0. They allow users to easily annotate various types of web resource withsimple keywords – tags. The tagging annotations, so called soft semantics, are in ouropinion a very important source of semantics for a resource, in some cases, e.g., photos,they might be the only one. Most of these services provide REST API which givesaccess to the soft semantics.

4 A Process for Acquiring Semantics on the Web

In the previous section we have introduced a comprehensive classification of ways howsemantic information about certain web resources can be acquired. In this section wewill present how each of these solutions can be utilized to amass as much semantics,both hard and soft, about resources as possible.

We have divided the process of collecting semantics in three phases (see Fig. 5): (1)Extracting semantics from web resources. (2) Query external services for semantics. (3)Augmenting with social annotations.

2 Simile project: http://simile.mit.edu3 RDFizers: http://simile.mit.edu/wiki/RDFizers4 Solvent: http://simile.mit.edu/wiki/Solvent5 PiggyBank: http://simile.mit.edu/wiki/Piggy\_Bank6 DBPedia: http://dbpedia.org/7 Swoogle: http://swoogle.umbc.edu/8 Ping the Semantic Web: http://pingthesemanticweb.com/

Fig. 5. Process of collecting semantics

4.1 Extracting Semantics from Web Resources

This phase attempts to acquire semantics directly from the given resource. First step isto check for the existing, ready to use RDF documents attached with the <link>. Thenext step is to recognize any type of semantics embedded into the web resource, suchas microformats, eRDF, RDF/A, etc. Eventually, the GRDDL service is used to checkif the site is GRDDL-enabled.

Next step is to recognize REST-API enabled services, such as for example FOAF-Realm. Finally, there is an attempt to recognize if the special extraction process has notbeen defined for the given resource. It may be the case for the most popular sourcesof information that does not provide semantics in any of aforementioned ways (e.g.Wikipedia, Blogger).

4.2 Query for Semantics

Because one of the key features of the RDF – the open world assumption – the frag-ments of the RDF graph can be published on other (different from the one that hasbeen given) services on the Web. In order to aggregate the semantics about the givenresource from as many sources as possible we have to query other services as well.The process contacts PingTheSemanticWeb.com to retrieve other semantics submittedby other sources. Further on, the Swoogle service is contacted to retrieve informationabout unknown ontology concepts, used in the current description of the resoruce. Fi-nally, Sindice is called to check for other potential sources of semantics about givenresource. These sources are later queried for RDF using HTTP GET.

4.3 Augmenting Semantics with Social Annotations

In the previous section (see Sec. 3) we have stressed the importance of the soft semanticsdelivered by the community of users. People are able to better understand informationthey provided themselves, or that has been provided by people they know (and mostlikely think alike). This observation is one of the key reasons of the success of Web 2.0services. Tagging information (soft semantics) are, however, not interconnected enoughto unleash the full potential of the Semantic Web technologies. Nevertheless, in our

opinion, they should not be neglected in the process of acquiring semantics about givenweb resources. So at the end of the harvesting process all the soft semantics providedby user are attached to the graph gathered in the previous stages.

5 The architecture of the IKHarvester

The IKHarvester service delivers the prototype implementation of the architecture foracquiring hard and soft semantics for the web resources described in the previous sec-tion (see Sec. 4). It delivers three core features: harvesting semantics for web resources,augmenting semantics of web resources with social annotations, and providing accessto the aggregated semantics to other services, including eLearning frameworks. It ben-efits from the Semantic Web principles that demands rich descriptions of resource tobe available online. IKHarvester was built to simplify extending the infrastructure withsupport for new types of semantics acquiring solutions. The current implementation(see Fig. 6) consists of five main components:

1. Two racks for harvesting plugins: (a) one for extracting semantics, (b) second forquerying for semantics.

2. A component for aggregating social annotations consists of a bookmarking partand tagging part; the latter one is the contribution of the IKHarvester to the SocialSemantic Collaborative Filtering (SSCF) infrastructure.

3. Semantics, both soft and hard, acquired by both aforementioned components arestored in an RDF Storage (currently Sesame).

4. REST API provides access to services delivered, and to semantics amassed byIKHarvester. It is build to post and query semantics and also to access them indifferent formats9.

5. IKHarvester also delivers a set of extensions for various web browsers, rangingfrom a plugin for Firefox, to a package for Internet Explorer and to generic browserbuttons.

5.1 Rack for Harvesting "Blades"

Fig. 6. Architecture of IKHarvester

The current version of IKHavester de-livers support for SIOC exporters, Mi-croformats, GRDDL, MediaWiki, andBlogger. IKHarvester provides imple-mentation of the plugins for query-ing semantic services, such as DB-Pedia and PingTheSemanticWeb.com.Each of the harvesting plugins hasto implement a simple Java interface(DataHarvester), which defines aharvestMetadata() method. Eachplugin is called one by one and deliversas much semantics as it can recognize.

9 e.g., LOM standard recongnized by most of the Learning Management Systems (LMS)

5.2 Plugin for Web Browsers

IKHarvester delivers three types of web browser plugins. The primary goal of theseplugins is to ease the process of submitting the page a user has stumbled upon to theIKHarvester repository (for future retrieval).

For Firefox, a plugin detects supported pages; in case of Wikipedia, it even inte-grates with the content of the navigation links For other browser IKHarvester providesa set of buttons which may be placed by the user in a browser toolbar. In case of InternetExplorer those buttons can be added with a special installer.

5.3 Annotation Services

To complete the harvesting process, according to Sec. 4.3 we need an user input. IKHar-vester allows any user to submit an URI for harvesting. Users are, however, encouragedto register into the service that hosts IKHarvester and provide social annotations alongwith the harvested hard semantics. Users can choose between providing simple key-words, with a user interface similar to the one created by del.icio.us. IKHarvester usestagging ontology [10] to bind tags with users and web resources being annotated.

Another option is to bookmark a given resource. To do so we have integrated IKHar-vester with Social Semantic Collaborative Filtering service [17]. With SSCF users cancreate folders with bookmarks, annotate them using controlled vocabularies (WordNet,ACM, DDC, etc), and share them with their friends. Along the process, a communitybased taxonomy is established, with a customized view, different for each user. So re-source can be placed in such hierarchy and IKHarvester will process with harvestingthe semantics.

5.4 REST API

IKHarvester follows the best practices of current Web 2.0 applications, and defines aREST-based API for other services to interact with it. This API defines [8] methodsfor submitting new resources for harvesting, removing harvested resources from thedatabase, and listing the content of the database. The access methods also providesmeans for interaction with eLearning systems, such as Didaskon [21]. IKHarvester ex-poses semantics in the LOM standard (the manifest of the learning objects) as well asthe content of the web resource (usually an abstract). Interaction with LMSes providesa proof of concept of the extensibility of IKHarvester API.

6 IKHarvester in Use

IKHarvester, which architecture we have presented in previous section is currently beenbeing used in a real world solutions. In this section we describe briefly those systemsand how they benefits from IKHarvester.

6.1 Successful deployments

notitio.us10 is a service for aggregating metadata-rich information from various typesof social semantic information sources. Notitio.us allows users to easily discover andshare their knowledge. It also provides an interesting solution to further informationbrowsing, using either faceted navigation or tags-based filtering. IKHarvester is an im-portant component of this system as it provides other modules with a semantic metadata.

One of the notitio.us components is also SSCF so IKHarvester can be used in twopossible ways (see Sec. 5.3) – by invoking native user interface (similar to del.icio.us)or through the SSCF bookmarking interface (annotating by placing in a hierarchy).

Information gathered by IKHarvester can be then searched and browsed using an-other notitio.us component – TagsTreeMaps11, a tags browser based on a treemaps ren-dering algorithm. This information also improves a user browsing experience in theother notitio.us component – MultiBeeBrowse [18] – unstructured metadata browser.

Didaskon Didaskon is a framework for automated composition of a learning path fora student. The selection and workflow scheduling of learning objects is based on theirdescription, semantically annotated specification of user profiles, anticipated knowledgeafter course completion, and technical details of the client’s platform. IKHarvester isable through it’s SOA to provide the informal knowledge that is used to construct aspecific course. Our system keeps the semantic data in RDF format, thus it is very easyto convert it (provide suitable mappings) to any other. This was done in this case –Didaskon can access well annotated, rich in semantics resource in the LOM12; so theresources are almost ready to be used by learning management system.

7 Evaluation

We have decided that the best way to evaluate our system is to compare it to the existingsolutions. Thus we will start with short presentation of relevant application. We will alsoinstantly present how those systems relate to IKHarvester.

7.1 State of the art

del.icio.us13 del.cio.us is a social bookmarking system. The system can substitute thebrowser bookmarks and it has a lot of benefits. Saved resources are accessible from anycomputer, tags help users to find and organise the bookmarks. They also become publicso it is possible to find similar and popular resources.

From this papers perspective the most important thing in del.icio.us are tags – thesoft semantics and the collaborative tagging process. The strategy of tagging, withoutregard to categorical constraints seems like a recipe for disaster, but as the Web has

10 notitio.us: http://notitio.us/11 TagsTreeMaps: http://sf.net/projects/tagstreemaps/12 Learning Object Metadata13 Del.icio.us: http://del.icio.us/

shown us, you can extract a surprising amount of value from large unstructured datasets. Moreover, this kind of collaborative tagging offers an interesting alternative tocurrent research efforts with Semantic Web ontologies [19].

PingTheSemanticWeb.com14 is a service for sharing RDF documents. Its enginelooks for RDF data either in the content of the resource with the specified URL or indocuments this resource links to. If such data is found, it is saved to the shared reposi-tory. PingtheSemanticWeb.com supports FOAF, SIOC, and DOAP ontologies, and otherRDF documents. The system is designed to be used by Software agents. They can re-quest the service for a list of stored RDF documents and use that information for crawl-ing the Semantic Web.

By providing Web Services, PingtheSemanticWeb.com allows the gathering of se-mantic annotations for online resources in a shared space. This information can be usedfor instance by crawlers while searching for a specific piece of data. But, PingtheSeman-ticWeb.com does not come up with the possibility to browse stored data besides viewingraw RDF documents. It does not work with non-semantic sources, like Wikipedia.

Zotero15 is an add-on for Firefox web browser. It helps with collecting, managing, andciting research material, mainly bibliographic resources. Zotero extracts RDF injectedinto XHTML documents; it works with a few standards and microformats [14]:embed-ded RDF, COinS, Dublin Core [9], and Marc [1].

Zotero is a powerful tool for researchers and students because it facilitates bibli-ographic resources management. However, it only reads embedded RDF; there is nosupport for pure RDF data, which passes more knowledge.

SIMILE Project16 Semantic Interoperability of Metadata and Information in unlikeEnvironments provides tools for metadata managers and common end-users. PiggyBank [13], an add-on for Firefox, changes the browser into a mash-up platform, byallowing to capture metadata for online resources and mix them together. Collecteddata can be stored locally, tagged, searched, browsed and sent to the public repository.

Piggy Bank is capable of reading whole RDF documents that a web page links to.Moreover, although it does not support non-semantic web pages itself, it is possibleto write “screen scrapers” that can do that. There are also some limitation in brows-ing gathered data, as it can be done only in build in RDF browser (Longwell17) andaccessing raw RDF data is hindered.

7.2 Evaluation

We showed in previous sections that IKHarvester is mainly a background applicationwhich serves other systems with semantically interlinked data. Hence the evaluation

14 PingTheSemanticWeb.com: http://pingthesemanticweb.com/15 Zotero: http://www.zotero.org/16 SIMILE Project: http://simile.mit.edu/17 Longwell: http://simile.mit.edu/wiki/Longwell

of the IKHarvester slightly differs from the standard one. We have decided to evaluateit by comparing to Piggy Bank (See Sec. 7.1). It was the application with the biggestlevel of similarity to IKHarvester. Zotero is focused on grabbing the information aboutpublications, so it covers only a small, specific part of the Web, while the focus ofPingTheSemanticWeb.com is to cooperate with crawlers and software agents.

Evaluation metrics We have decided to compare the performance and quality of dataharvested with two aforementioned solutions. As the metric for performance we haveselected the average time for gathering and tagging the data from a web page. Themetrics for quality are more complex. We have decided to count the number of triplesharvested by each solution and how the gathered data is interlinked. While the first met-ric is quite straightforward, the latter one needs some explanation. As it is stated in themotivation (see Sec. 2) Semantic Web can reach its full potential when information isconnected. This is possible with interconnections in an RDF graph. That is why we havefinally decided to check the average number of links to other information in harvesteddata.

Evaluation process With respect to time we have selected 30 web pages18 and per-formed the harvesting process on each of them using the two aforementioned tools. Foreach page we have assigned 3 tags. With Piggy Bank, to have the data public as theyare in IKHarvester each page needed to be published in a public bank19. With respectto quality of the harvested data we have selected 10 web pages - 4 blog posts (one fromblogger platform, and 3 with SIOC support), 5 Wikipedia articles and one digital li-brary (JeromeDL) resource. We have assumed that this set would be quite common inthe process of finding information about some topic. We realise that such assumptionmay be ambiguous, especially with blogs and SIOC support - but in this case is notrelevant for comparison process as both engines can utilise such information.

Table 1. Evaluation resultsMetric IKHarvester Piggy BankTime (sec) 16.4 15.3No. of triples 530 260Interconnections 263 81

Evaluation results Evaluationresults (see Table 1) shows thatIKHarvester performed betteron a given set of the pages.While the difference in perfor-mance is very small there is abig gap in a quality measurements results. Our solution was able to scrape much moredata. What is more important data delivered by IKHarvester were richer in interconnec-tions. Piggy Bank data were generally based only on properties scraped from HTML(like title, creator).

18 We have selected those supported by current implementation of IKHarvester, although itshould not affect the results as a time to gather information for both application does notdepend on type of resource

19 We have used standard one - http://simile.mit.edu/bank/

Fig. 7. Comparison of harvesting solutions features.

Non measurable evaluation results We have examine the differences between harvest-ing solutions (See Fig. 7) in many aspects. One of the most important one betweenIKHarvester and the Piggy Bank (second system that we have tested in depth) is thatthe latter one, although it provides very reach build faceted-browsing can only workwith Firefox browser. Advantage of IKHarvester is that it is generally browser indepen-dent.

Second issue is that Piggy Bank does not offer an easy way for accessing raw RDFinformation that have been gathered either in local or public database. Hence there maybe problem with reusing data outside the Longwell browser. This may result from theobjectives behind the tool but still we should point out that IKHarvester provides easy,REST based interface for accessing raw RDF data and utilising it in any way.

8 Conclusions and Future Work

In our paper we have showed that in order to benefit from the Semantic Web it is im-portant to have heavily interlinked metadata. Additionally, these metadata should comefrom both social annotations, which we call soft semantics, and from web resource it-self (hard semantics). We have also presented a comprehensive classification of sourcesof both types of web semantics. Based on this classification, we have proposed a har-vesting process, which tries to utilise each of the available sources. Finally we have in-troduced IKHarvester, a prototype which implements introduced process; we followedwith IKHarvester architecture and successful deployments. Eventually, we have con-cluded with results of the evaluation of the harvesting process. It showed that IKhar-vester can be a good source of the web semantics.

Our future work will concentrate on delivering support for new potential sourcesof web semantics. We will also focus on the sources that do not expose semantics ex-plicitly. One of the main design goals of the IKHarvester was to make it extensible; wehope to be able to test this goal by delivering support for many more types of sourcesof web semantics in the future.

References

1. MARC (MAchine Readable Cataloging) and SGML/XML.http://xml.coverpages.org/marc.html, July 2002.

2. B. Adida and M. Birbeck. RDFa primer 1.0. Word Wide Web Consortium, 2006.3. C. Bizer, R. Cyganiak, and T. Heath. How to publish linked data on the web.4. J. Breslin, A. Harth, U. Bojars, and S. Decker. Towards semantically-interlinked online

communities. In Proceedings of the 2nd European Semantic Web Conference (ESWC 05),Heraklion, Greece, volume 3532, pages 500–514, June 2005.

5. E. Brockley, A. Cai, R. Henderson, E. Picciola, and J. Zhang. The next technological revo-lution: Predicting the technical future and its impact on firms, organizations and ourselves.

6. D. Connolly. Gleaning resource descriptions from dialects of languages (GRDDL). WorldWide Web Consortium, Working Draft WD-grddl-20070302, Mar. 2007.

7. I. Davis. Introducing embedded rdf.8. J. Dobrzanski. Social semantic information sources for elearning. Master’s thesis, WETI,

Gdansk Univeristy of Technology, Poland, 2007.9. DublinCore Initiative, http://dublincore.org/documents/dces/. Dublin Core Metadata Ele-

ment Set, Version 1.1: Reference Description.10. T. Gruber. Folksonomy of ontology: A mash-up of apples and oranges. In First on-Line

conference on Metadata and Semantics Research (MTSR’05), November 2005.11. R. V. Guha, R. McCool, and E. Miller. Semantic search. In WWW, pages 700–709, 2003.12. A. Gzella. Service oriented architecture for distributed user management system. Master’s

thesis, WETI, Gdansk Univeristy of Technology, Poland, 2006.13. D. Huynh, S. Mazzocchi, and D. Karger. Piggy bank: Experience the semantic web inside

your web browser. Web Semantics, 5(1):16–27, 2007.14. R. Khare. Microformats: The next (small) thing on the semantic web? IEEE Internet Com-

puting, 10(1):68–75, 2006.15. M. Kroetzsch, D. Vrandecic, and M. Voelkel. Semantic mediawiki. In Proceedings of ISWC

2006, pages 935–942, 2006.16. S. Kruk, M. Synak, and K. Zimmermann. Marcont initiative - mediation services for digital

libraries. In ECDL, 2005.17. S. R. Kruk and S. Decker. Social Semantic Collaborative Filtering with FOAFRealm. In

Semantic Desktop Workshop, ISWC 2005, 2005.18. S. R. Kruk, A. Gzella, F. Czaja, W. Bultrowicz, and E. Kruk. Multibeebrowse - accessible

browsing on unstructured metadata. In The 6th International Conference on Ontologies,DataBases, and Applications of Semantics, November 2007.

19. C. Shirky. Ontology is overrated: Categories, links and tags, 2005.20. G. Tummarello, R. Delbru, and E. Oren. Sindice.com: Weaving the open linked data. In

ISWC/ASWC, pages 552–565, 2007.21. A. Westerski, S. R. Kruk, K. Samp, T. Woroniecki, F. Czaja, and C. O’Nuallain. E-learning

based on the social semantic information sources. In Proceedings to LACLO’2006, 2006.

Date post:	14-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Harvesting Web Semantics Augmented with Social Annotations · 2008-02-04 · Harvesting Web...

Documents