Post on 03-Feb-2016
description
transcript
Slide 1 Reproduksjon forbudt uten tillatelse fra Computas AS ©
Implementation of Topic Centered Portals
David NorheimComputas AS, Norway
Robert Engels, ESIS AS, Norway
Slide 2 Reproduction prohibited without authorization by Computas AS ©
Motivation
The system
Challenges and lessons learned
Future work
Slide 3 Reproduction prohibited without authorization by Computas AS ©
Computas
23 years experience in knowledge management, expert systems, and process modeling
Special focus toward government and the oil- and gas sector
The major semantic web company in Norway
Slide 4 Reproduction prohibited without authorization by Computas AS ©
Computas’ semantic Web activities
Sectors
• Oil- and gas industry
• Government
Type of applications
• Knowledge management
• Semantic search support
• Research and commerical projects
Slide 5 Reproduction prohibited without authorization by Computas AS ©
Background A clear shift towards open source and open standards
• Linux for Schools, Open Document formats in the public sector
• National semantic registry, Large governmental information portals based on semantic standards
• The government through Norwegian Archive, Library and Museum Authority (ABM-utvikling): development of an open standard based, open-source software for creation and maintenance of topic-driven portals.
”there is a need for a targeted effort to create a framework based on Semantic Web to enable professional users to organize information and to make libraries build and maintain metadata-driven search solutions.”
A digital culture and knowledge policy?
EFN.no
Slide 6 Reproduction prohibited without authorization by Computas AS ©
A topic driven portal
For a library it is as natural to evaluate, describe and enable retrieval of any resource on the web as printed material
Quality evaluated collection of information resources organized according to some topic structure and published online.
Retrieval through search and navigation in topics
Source: Ellen Aarbakken, Oslo Public Library (Deichmanske Bibliotek)
Slide 7 Reproduction prohibited without authorization by Computas AS ©
Why a topic centric portal tool and not search?
Yahoo! provided the first subject driven portal, but focused on most popular aspects -> replaced by Search (e.g. Google)
However, the words in the long tail is context dependent, and generic web search will frequently pollute results due to ambiguity
Example of long tail portals
• Medical information for laymen
• Primary school educational resources
• Public information for immigrants
• Juridical information for laymen
• Norwegian architecture portal
Slide 8 Reproduction prohibited without authorization by Computas AS ©
Why not Web 2.0?
• Folksonomies• Collaborative “categorization” • Freely chosen keywords• Manual “tagging”, practically
no existing metadata• Mostly acting as a
popularity measure
• Topic tools• Conceptual level
with navigation• Quality evaluated
with metadata• Manual “tagging”,
but support formore automation
Slide 9 Reproduction prohibited without authorization by Computas AS ©
SUBject oriented tool for LIbraries, Museums
and Archives Several roads to the same
destination
Key requirements in developing the tool
• Handle metadata of various sources and vocabularies (e.g. Dublin Core)
• Interoperability - among portals based on the same tool and same protocols (SPARQL, SRU)
• Open source and open (semantic web) standards
• Combining free text search and navigation through models
• Handling both informal and formal models (e.g. SKOS and OWL DL) - future
Slide 10 Reproduction prohibited without authorization by Computas AS ©
Scandinavian Medical Information for Laymen (SMIL) is a Scandinavian international cooperation to offer quality controlled meta-data with references to pages related to health, illnesses and treatments. Contributing partners to the portal are librarians and nurses from the Nordic countries. The current SMIL base consists of 8500 records creating around 250.000 triples.
Two initial portals
Detektor targets public schools. Resources are annotated by public libraries consists of about 1850 topics and 4600 resources. This results in about 100.000 triples
Slide 11 Reproduction prohibited without authorization by Computas AS ©
Portal Technical Characteristica (grounding technologies)
Technology Name, release Comment
Operating system Linux Ubuntu Also tested on Redhat, Windows and OS X
Database Postgress (under Jena) indexing with Lucene
Should work with any SPARQL and SPARUL supported storage
Document repository
The Web, any URLs
Webserver Apache Tomcat v.5.5 and 6.0 Also tested on RESIN
Applied ontology Domain Ontology and Portal ontology (object types)
Ontology Language
SKOS, RDF/S Currently implementing OWL support
Export/Import RDF/XML, Turtle
Reuse and Interoperbility
Voc.: DC, FOAF, SIOC, Powder Lingvoj. Query lang.: SPARQL, CQL
Also using SPARUL
Inference engine None Will implement OWL DL supported inference engine Q4 2008
Ontology editor Internal web-based, Protégé (external)
Export ontology and continue to work in any RDF/OWL compliant ontology editor
User interface HTML, Apache Cocoon
License Open Source, CDDL-lisence
Evaluation criterieas inspired by the Esperonto project
Slide 12 Reproduction prohibited without authorization by Computas AS ©
Architecture
Web client
Search and navigation
SPARQL dispatcher
SPARQL queries
Local endpoint
IndexingTopic ontology
Metadata store
Ontology maintenance
External clients
SUR
clien
t
External servers
SRU
serv
er
SPAR
QL
endp
oint
Crawler
Portal configuration
SPARQL update
Web resources
Ope
n se
arch
SPAR
QL
clien
t
The client consists of a search interface allowing users to search using free text and meta-data search. The search string is transferred into a structured SPARQL query Interoperabilit
y at the query layer
System accept queries from both SPARQL and SRU/CQL
Backend consists of an RDF Store with SPARQL interfaces. Freetext indexing using lucene/LARQ
System can query external SPARQL and SRU/CQL services
Slide 13 Reproduction prohibited without authorization by Computas AS ©
Sublima Ontologies generally
provide the structure for the navigation of the results, support browsing and classification.
Ontologies allow for term disambugation, query rewriting and semantic distance measures
In sublima we use informal SKOS to
• Navigating through subjects, showing the subject relations (“fish eye”)
• Search expansion; synonyms, common misspellings
• Faceted filtering; topics as well as other metdata
Future version will also support OWL DL
Slide 14 Reproduction prohibited without authorization by Computas AS ©
Good and bad choices, lessons learned the hard way
• Keeping the semantics
• Living with free-text indexing and structrued queries
• Tool maturity
• Scalabilty
Keep in mind this is NOT a research project, but with a real and demanding customers expecting everything to work
Slide 15 Reproduction prohibited without authorization by Computas AS ©
Perserving the semantics
We needed flexibility for users to add any metadata without touching code
SPARQL SELECT loses the meaning returning only a binding, hence clients become static. We therefore used SPARQL DESCRIBE extensively
DESCRIBE ?x
WHERE { ?lit pf:textMatch ”cancer*”@en .
?x dc:title ?lit .
}
Slide 16 Reproduction prohibited without authorization by Computas AS ©
Living with free-text indexing and structrued queries
Indexing with respect to structure
Our breastfeeding twin-problem
• Not sufficient to index all literals as users expect hits on the combination of dc:title and dc:description
• And even worse; the combination of dc:title and dc:subject/skos:preferedLabel
Scoring/ranking
• Easy with SELECT, but not with DESCRIBE
• How do you rank results from a structured query?
No universal way to handle sturctured and unstructure information
Slide 17 Reproduction prohibited without authorization by Computas AS ©
Constistent tool maturity and missing links
Some ”small” issues
• Support for Turtle in Protégé -> needed to convert to RDF/XML
• Resources identified with URLs in Protégé
• Tools mostly geared towards one dialect of RDF/OWL
• Indeterministic RDF/XML serialization for XSLT processing
• Lacking a binding from OWL classes to OO languages
The simple things sometimes turns out to be the hardest…
Slide 18 Reproduction prohibited without authorization by Computas AS ©
Scalability
Response time varies with store size and query complexity• Too much complexity
in queries
Moving from 500k triples to 10th of millions• Need to refactor into
smaller faster queries
• Federation of queries
Slide 19 Reproduction prohibited without authorization by Computas AS ©
Some good lessons
New standards (e.g. SPARQL), proposals for standardization (e.g. SPARUL), new tools (e.g. Jena), open source (e.g. Tomcat, Apache), lack of good documentation all say high risk!!!!
However, the support and maintenance from the W3C community and open source developers (e.g. Jena team) has been impressive, the support through IRC channels, mailing lists etc has been invaluable for the project.
Slide 20 Reproduction prohibited without authorization by Computas AS ©
Some good lessons
Good experiences with reusing metadata schemas• FOAF, Dublin Core, Powder, SKOS, SIOC,
Lingvoj
Extensive dereferencing of URIs, any topic and resource URI pasted in the browser results in a DESCRIBE query for that URI.
Slide 21 Reproduction prohibited without authorization by Computas AS ©
Living with informal and formal ontologies
Current ontologies are modeled informally with W3C Simple Knowledge Organization System (SKOS)• No distinction between part-of, contains,
is-a
• No reasoning support
• Possible with small datasets
Sublima will also support models using formal ontologies• Formal IS-A
• DL reasoning
• Required for large datasets
Expressivity
Reasoning
Large data sets
Smaller data sets
Slide 22 Reproduction prohibited without authorization by Computas AS ©
Future work
• Integration with other SPARQL-based portals.
• Interoperability with ISO Topic Maps models
• Graphical visualization with touch screen, clever UIs
• Hi-quality multimedia resources
The code-base is no in use in more
projects
Slide 23 Reproduction prohibited without authorization by Computas AS ©
Conclusion
We clearly found that the technology currently available starts to reach a certain state of maturity if it comes to functionality. BUT STILL RISKS!
Careful evaluation of tools and scalability is needed as content increases.
Query interoperability
Do not eat the whole menu at once!
Recording companies
Broad-casters High quality metadata Open metadata
e.g.Wikipedia
Slide 24 Reproduction prohibited without authorization by Computas AS ©
Thank you for your attentiondavid.norheim@computas.com
We welcome sharing our experiences with yours! Welcome to upcoming conferences in Norway next year
•Mid February in Oslo - hands-on tutorials
•May in Stavanger - Semantic Days focusing on the oil- and gas industry
•September 2008 - initiating Scandinavian Semantic Web Conference