Implementation of Topic Centered Portals

transcript

Reproduksjon forbudt uten tillatelse fra Computas AS ©

David NorheimComputas AS, Norway

Robert Engels, ESIS AS, Norway

Reproduction prohibited without authorization by Computas AS ©

Motivation

The system

Challenges and lessons learned

Future work

Computas

23 years experience in knowledge management, expert systems, and process modeling

Special focus toward government and the oil- and gas sector

The major semantic web company in Norway

Computas’ semantic Web activities

Sectors

• Oil- and gas industry

• Government

Type of applications

• Knowledge management

• Semantic search support

• Research and commerical projects

Background A clear shift towards open source and open standards

• Linux for Schools, Open Document formats in the public sector

• National semantic registry, Large governmental information portals based on semantic standards

• The government through Norwegian Archive, Library and Museum Authority (ABM-utvikling): development of an open standard based, open-source software for creation and maintenance of topic-driven portals.

”there is a need for a targeted effort to create a framework based on Semantic Web to enable professional users to organize information and to make libraries build and maintain metadata-driven search solutions.”

A digital culture and knowledge policy?

EFN.no

A topic driven portal

For a library it is as natural to evaluate, describe and enable retrieval of any resource on the web as printed material

Quality evaluated collection of information resources organized according to some topic structure and published online.

Retrieval through search and navigation in topics

Source: Ellen Aarbakken, Oslo Public Library (Deichmanske Bibliotek)

Why a topic centric portal tool and not search?

Yahoo! provided the first subject driven portal, but focused on most popular aspects -> replaced by Search (e.g. Google)

However, the words in the long tail is context dependent, and generic web search will frequently pollute results due to ambiguity

Example of long tail portals

• Medical information for laymen

• Primary school educational resources

• Public information for immigrants

• Juridical information for laymen

• Norwegian architecture portal

Why not Web 2.0?

• Folksonomies• Collaborative “categorization” • Freely chosen keywords• Manual “tagging”, practically

no existing metadata• Mostly acting as a

popularity measure

• Topic tools• Conceptual level

with navigation• Quality evaluated

with metadata• Manual “tagging”,

but support formore automation

SUBject oriented tool for LIbraries, Museums

and Archives Several roads to the same

destination

Key requirements in developing the tool

• Handle metadata of various sources and vocabularies (e.g. Dublin Core)

• Interoperability - among portals based on the same tool and same protocols (SPARQL, SRU)

• Open source and open (semantic web) standards

• Combining free text search and navigation through models

• Handling both informal and formal models (e.g. SKOS and OWL DL) - future

Scandinavian Medical Information for Laymen (SMIL) is a Scandinavian international cooperation to offer quality controlled meta-data with references to pages related to health, illnesses and treatments. Contributing partners to the portal are librarians and nurses from the Nordic countries. The current SMIL base consists of 8500 records creating around 250.000 triples.

Two initial portals

Detektor targets public schools. Resources are annotated by public libraries consists of about 1850 topics and 4600 resources. This results in about 100.000 triples

Portal Technical Characteristica (grounding technologies)

Technology Name, release Comment

Operating system Linux Ubuntu Also tested on Redhat, Windows and OS X

Database Postgress (under Jena) indexing with Lucene

Should work with any SPARQL and SPARUL supported storage

Document repository

The Web, any URLs

Webserver Apache Tomcat v.5.5 and 6.0 Also tested on RESIN

Applied ontology Domain Ontology and Portal ontology (object types)

Ontology Language

SKOS, RDF/S Currently implementing OWL support

Export/Import RDF/XML, Turtle

Reuse and Interoperbility

Voc.: DC, FOAF, SIOC, Powder Lingvoj. Query lang.: SPARQL, CQL

Also using SPARUL

Inference engine None Will implement OWL DL supported inference engine Q4 2008

Ontology editor Internal web-based, Protégé (external)

Export ontology and continue to work in any RDF/OWL compliant ontology editor

User interface HTML, Apache Cocoon

License Open Source, CDDL-lisence

Evaluation criterieas inspired by the Esperonto project

Architecture

Web client

Search and navigation

SPARQL dispatcher

SPARQL queries

Local endpoint

IndexingTopic ontology

Metadata store

Ontology maintenance

External clients

External servers

Crawler

Portal configuration

SPARQL update

Web resources

The client consists of a search interface allowing users to search using free text and meta-data search. The search string is transferred into a structured SPARQL query Interoperabilit

y at the query layer

System accept queries from both SPARQL and SRU/CQL

Backend consists of an RDF Store with SPARQL interfaces. Freetext indexing using lucene/LARQ

System can query external SPARQL and SRU/CQL services

Sublima Ontologies generally

provide the structure for the navigation of the results, support browsing and classification.

Ontologies allow for term disambugation, query rewriting and semantic distance measures

In sublima we use informal SKOS to

• Navigating through subjects, showing the subject relations (“fish eye”)

• Search expansion; synonyms, common misspellings

• Faceted filtering; topics as well as other metdata

Future version will also support OWL DL

Good and bad choices, lessons learned the hard way

• Keeping the semantics

• Living with free-text indexing and structrued queries

• Tool maturity

• Scalabilty

Keep in mind this is NOT a research project, but with a real and demanding customers expecting everything to work

Perserving the semantics

We needed flexibility for users to add any metadata without touching code

SPARQL SELECT loses the meaning returning only a binding, hence clients become static. We therefore used SPARQL DESCRIBE extensively

DESCRIBE ?x

WHERE { ?lit pf:textMatch ”cancer*”@en .

?x dc:title ?lit .

Living with free-text indexing and structrued queries

Indexing with respect to structure

Our breastfeeding twin-problem

• Not sufficient to index all literals as users expect hits on the combination of dc:title and dc:description

• And even worse; the combination of dc:title and dc:subject/skos:preferedLabel

Scoring/ranking

• Easy with SELECT, but not with DESCRIBE

• How do you rank results from a structured query?

No universal way to handle sturctured and unstructure information

Constistent tool maturity and missing links

Some ”small” issues

• Support for Turtle in Protégé -> needed to convert to RDF/XML

• Resources identified with URLs in Protégé

• Tools mostly geared towards one dialect of RDF/OWL

• Indeterministic RDF/XML serialization for XSLT processing

• Lacking a binding from OWL classes to OO languages

The simple things sometimes turns out to be the hardest…

Scalability

Response time varies with store size and query complexity• Too much complexity

in queries

Moving from 500k triples to 10th of millions• Need to refactor into

smaller faster queries

• Federation of queries

Some good lessons

New standards (e.g. SPARQL), proposals for standardization (e.g. SPARUL), new tools (e.g. Jena), open source (e.g. Tomcat, Apache), lack of good documentation all say high risk!!!!

However, the support and maintenance from the W3C community and open source developers (e.g. Jena team) has been impressive, the support through IRC channels, mailing lists etc has been invaluable for the project.

Some good lessons

Good experiences with reusing metadata schemas• FOAF, Dublin Core, Powder, SKOS, SIOC,

Lingvoj

Extensive dereferencing of URIs, any topic and resource URI pasted in the browser results in a DESCRIBE query for that URI.

Living with informal and formal ontologies

Current ontologies are modeled informally with W3C Simple Knowledge Organization System (SKOS)• No distinction between part-of, contains,

• No reasoning support

• Possible with small datasets

Sublima will also support models using formal ontologies• Formal IS-A

• DL reasoning

• Required for large datasets

Expressivity

Reasoning

Large data sets

Smaller data sets

Future work

• Integration with other SPARQL-based portals.

• Interoperability with ISO Topic Maps models

• Graphical visualization with touch screen, clever UIs

• Hi-quality multimedia resources

The code-base is no in use in more

projects

Conclusion

We clearly found that the technology currently available starts to reach a certain state of maturity if it comes to functionality. BUT STILL RISKS!

Careful evaluation of tools and scalability is needed as content increases.

Query interoperability

Do not eat the whole menu at once!

Recording companies

Broad-casters High quality metadata Open metadata

e.g.Wikipedia

Thank you for your attentiondavid.norheim@computas.com

We welcome sharing our experiences with yours! Welcome to upcoming conferences in Norway next year

•Mid February in Oslo - hands-on tutorials

•May in Stavanger - Semantic Days focusing on the oil- and gas industry

•September 2008 - initiating Scandinavian Semantic Web Conference

Implementation of Topic Centered Portals

Documents