Date post: | 31-Mar-2015 |
Category: |
Documents |
Upload: | ethan-leiner |
View: | 215 times |
Download: | 0 times |
XX
MM
LL
andand
KK
MM
XML and KM Powering Information and Retrieval
for the Semantic Web
Frank CervoneAssistant University Librarian for Information Technology,
Northwestern University
Darlene FichterData Library Coordinator,
University of Saskatchewan Library
XX
MM
LL
andand
KK
MM
Introductions
• Who are you?
• Where do you work?
• What is your experience with KM?
• What is your interest in XML?
XX
MM
LL
andand
KK
MM
Outline
• Semantic Web and KM• What is XML?• SGML & HTML - where do they fit?• XML - Structure and Elements• XML Applications
– Integration of disparate content• News
– Expertise profiling– Enterprise solutions
XX
MM
LL
andand
KK
MM
Semantic Web
“The Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”
Tim Berners-Lee and others
XX
MM
LL
andand
KK
MM
One Goal
Support elaborate precise
searchesby integrating
and utilizing all relevant sources of information / relationships.
Illustration from Scientific American May 1, 2001
XX
MM
LL
andand
KK
MM
Is XML a magical fix?
• Not likely.
• It does not magically integrate redundant data versions
• We’re unlikely to replace systems with single, common shared version of integrated just for this reason
• But, if used correctly, XML can help
XX
MM
LL
andand
KK
MM
Harness the Power of Semantics
• If we wish to harness this power, then we need to– To understand and resolve the different words
and meanings we use to refer to the same things– Consider ways and means of defining standard
terminology & establishing agreed upon meaning usually through standard metadata
– Be able to use XML messaging between applications and transformations
XX
MM
LL
andand
KK
MM
Pieces
XX
MM
LL
andand
KK
MM
XML – Codification of Knowledge
Knowledge Representation
In order for the “idea” to become a reality computers must have access to structured collections of information and sets of inference rules that they can use to conduct automated reasoning.
XX
MM
LL
andand
KK
MM
Why talk about the semantic web?
• Many of the “information intensive” processes of KM are facing the same challenge– Capture – formalize existing knowledge– Select and assess relevance, value ..– Store – in repository with schema– Share – distribute based on interest and work– Apply – retrieve, use in daily work– Create new knowledge
Beckman, T. Eight stage process of KM
XX
MM
LL
andand
KK
MM
XML & KM – What’s the connection?
• Many KM activities that have nothing to do with technology
• Some KM activities have technology is a key enabler or component– in these cases XML is often under the
hood– Knowing about XML means we can exploit
the opportunities and see the limitations
XX
MM
LL
andand
KK
MM
XML Overview
• Structured data interchange– A common syntax for expressing structure in
data
• Designed to account for “unstructured” data– documents
• Inherently conveys meaning/structure• Content and display separate from
structure• Delivered via standard text files
XX
MM
LL
andand
KK
MM
XML in 7 bullets
• New, but not that new• Structured data in a text file via markup• Self-describing information• Looks like HTML but isn't• Verbose text, isn't meant to be read• License-free, platform-independent and well-
supported• A family of technologies
(parts adapted from Bert Bos, http://www.si.uniovi.es/mirror/www.w3.org/XML/1999/XML-in-10-points)
XX
MM
LL
andand
KK
MM
Driving Forces for XML Adoption
• Internationalized media-independent electronic publishing
• Definition of platform-independent protocols for the exchange of data– electronic commerce– knowledge harvesting
• Information delivery to user agents – automatic processing after receipt
XX
MM
LL
andand
KK
MM
Benefits of Adoption
• Easier to develop software – handle specialized information distributed
over the Web
• Processing information using lighter-weight software
• Allows greater end-user control of information display– style sheets
• Metadata for resource discovery
XX
MM
LL
andand
KK
MM
The *ML family
• SGML
• HTML
• XML
From World Wide Web Consortium note W3C Data Formats, by Tim Berners-Lee.
XX
MM
LL
andand
KK
MM
SGML
• Designed for documents
• Very powerful
• Very complicated
• “Well defined” = strict rules
• Rigid - not very extensible
• Inappropriate for wide-spread use
XX
MM
LL
andand
KK
MM
HTML
• Simple, general-purpose document markup language
• Simple hyperlinking
• Designed for collaborative authoring
• Combined authoring and viewing roles
XX
MM
LL
andand
KK
MM
HTML Evolution
• Started with simple document description– Few tags designed for structuring
documents
• Quickly evolved– forms– images– tables– frames– fonts
XX
MM
LL
andand
KK
MM
HTML shortcomings
• Not easily extensible– HTML standards change too slowly– Browser-specific tags ("extensions")– Totally geared toward document display
• Limited data formatting– mathematics
• Can't markup data in any structurally meaningful way
XX
MM
LL
andand
KK
MM
Why can’t HTML be used for information exchange?
• HTML markup provides no inherent method of knowing what the information is about
• Browser paradigm is too constraining • Metadata schemes are deficient
– Search engines return far too many hits
• Can't related information items (pages) to one another
• One-way linking is somewhat limited
XX
MM
LL
andand
KK
MM
How HTML confuses content and presentation
• <h1>…<h6>
• <br>
• <p></p>
• <center>
• <table>
XX
MM
LL
andand
KK
MM
Example - content and presentation mixture in HTML
<HTML>
<BODY BGCOLOR=#FFFFFF>
<H1>005.72 M849et2001</H1>
<I>Enterprise application integration with XML and Java
</I>
<BR>
Upper Saddle River, NJ : Prentice Hall PTR, 2001
</BODY>
</HTML>
XX
MM
LL
andand
KK
MM
But what does it mean?
XX
MM
LL
andand
KK
MM
XML represents structure, not presentation
<marc>
<field=“245” indicator_1=“1” indicator_2=“0”>
<subfield=“a”>Enterprise application integration with XML and Java</subfield>
<subfield=“c”>J.P. Morganthal, with Bill la Forge</subfield>
</field>
<field=“260”>
<subfield=“a”>Upper Saddle River, NJ</subfield>
<subfield=“b”>Prentice Hall PTR</subfield>
<subfield=“c”>2001</subfield>
</field>
</marc>
XX
MM
LL
andand
KK
MM
XML is hierarchical
aEnterprise Application I ntegration w ith XML and J ava
cJ .P. Morganthal, w ith Bill la Forge
245title
aUpper Saddle R iver, NJ
bPrentice Hall PTR
c2001
260publisher
MARC
XX
MM
LL
andand
KK
MM
Nesting
<bigdoll>
<mediumdoll>
<littledoll>
rosette theme <littlestdoll/>
</littledoll>
<mediumdoll>
</bigdoll>
XX
MM
LL
andand
KK
MM
Elements, Attributes, and Content
<field=“245” indicator_1=“1” indicator_2=“0”>
<subfield=“a”>Enterprise application integration with XML and Java</subfield>
<subfield=“c”>J.P. Morganthal, with Bill la Forge</subfield>
</field>
XX
MM
LL
andand
KK
MM
DOM – Document Object Model
• DOM – a platform- and language-neutral interface that allow programs and scripts to dynamically access and update the content, structure and style of documents
• Built into web browsers and servers– Used by web browser for dynamic display
capabilities
XX
MM
LL
andand
KK
MM
Document Type Definition (DTD)
• A set of syntax rules for creating tags
• Defines – What tags can be used– The order they should appear in– Which tags can be nested– Which tags have attributes
• Can be part of an XML document– Typically defined externally
XX
MM
LL
andand
KK
MM
DTD and Elements
<!DOCTYPE BOOK[<!ELEMENT BOOK(AUTHOR?, TITLE,
PUBLISHER+,SUBJECT*)<!ELEMENT AUTHOR (#PCDATA)>
<!ELEMENT TITLE (#PCDATA)>
<!ELEMENT PUBLISHER (#PCDATA)>
<!ELEMENT SUBJECT (#PCDATA)>
]>
XX
MM
LL
andand
KK
MM
Attributes
<!ELEMENT PERSON EMPTY> <!ATTLIST PERSON person_id ID #REQUIRED> <!ATTLIST PERSON sex (M | F) #IMPLIED> <!ATTLIST PERSON status (employee | trainee) “employee”> <!ATTLIST PERSON company CDATA #FIXED “XYZ”>
XX
MM
LL
andand
KK
MM
Schemas
• Introduces a mechanism for strong typing– Allows a schema to be directly imported
into a database to create a table
• Standardized NULL representation
• Key representation
XX
MM
LL
andand
KK
MM
Well-formed and valid
• Well-formed– Conforms to the general rules of XML
syntax, which are very rigorous– Example – a tag must always be ended
• <title>Discourse Analysis</title>• <subtitle/>
• Valid– Documents that conform to the specific
DTD in use
XX
MM
LL
andand
KK
MM
XML-Link and XML Pointer
• Open set of linking elements• Non-directional
– arbitrary– non-hierarchical
• XML Pointer– Enables addressing any part of a text
• A more powerful HTML “anchor” tag
• XML-Link– Enables attaching a behavior to a link– Extended links, similar to a web ring
XX
MM
LL
andand
KK
MM
XML-Link Example
<related-URL-group>search
<related-URL HREF=“altavista.xml”/>
<related-URL HREF=“webbrain.xml”/>
<related-URL HREF=“yahoo.xml”/>
</related-URL-group>
<!ELEMENT related-URL-group (#PCDATA | related-URL)*>
<!ATTLIST related-URL-group
XML-Link CDATA #FIXED “EXTENDED”
INLINE CDATA #FIXED “TRUE”
CONTENT ROLE CDATA #FIXED “RT”
>
XX
MM
LL
andand
KK
MM
Displaying XML information in the browser
• XML parser built in– Relates data stream to DTD and style sheet
• Style Sheets– Only method for formatting XML data for display
• Similar to HTML CSS– More powerful
• XSLT– Processing language that allows for
transformation of data presentation
XX
MM
LL
andand
KK
MM
XHTML
• “Next generation” HTML
• HTML that conforms to XML standards
• Will eventually support integration with other XML applications
• Device independent web-access
XX
MM
LL
andand
KK
MM
XHTML Example
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<title>Bare bones example</title></head><body> <p>
<a href="http://validator.w3.org/check/referer"> validate </a>
</p></body></html>
XX
MM
LL
andand
KK
MM
HTML 4 - XHTML Major Differences
• All related to “well formedness”– Tag/attributes must be in lower-case– Elements must nest, no overlap– All non-empty elements must be closed– All empty elements must be terminated– Attribute values must be quoted– Attributes cannot be minimized– Scripts should be downloaded from server
XX
MM
LL
andand
KK
MM
XML Life Cycle
• Authoring
• Presentation
• Search and Retrieval
• Integration
XX
MM
LL
andand
KK
MM
The Big Picture
XX
MM
LL
andand
KK
MM
Just “Add Water & Stir”
XML (document or database)XSLT style sheet
XSLT Processor(XML Parser)
Browser(XML Parser)
XX
MM
LL
andand
KK
MM
Authoring Tools
• Editors (getting the content in)– XML and XSLT Editors
• XML Spy• XML Notepad• XMetal• Xeena
– Word processors• WordPerfect
– Content Management Systems
XX
MM
LL
andand
KK
MM
XML Spy
• Structured/document editor – XML– DTD– schemas (DCD, XDR, BizTalk, XSD)– XSLT
• Views for: – Structured editing (grid view, table view)– Document editing (WYSIWYG)
• Full Unicode support– MSXML3 is used by default, but can be changed
XX
MM
LL
andand
KK
MM
XML Notepad
• Quick and dirty editor for Windows
• Doesn't use DTD to guide editing– if present, however, validates it on
document loading
XX
MM
LL
andand
KK
MM
XMetal
• Professional, full-featured XML/SGML editing tool– word processor-like view– source view– tag view
• SGML or XML DTD's– context-sensitive lists of allowed elements and
attributes– supports CALS tables, DOM, CSS, and HTML
• Integrated browser preview for XML documents.
XX
MM
LL
andand
KK
MM
Xeena
• Loads DTD and provides tree-view syntax directed editing
• Aware of the DTD grammar– Makes only authorized elements icons
sensitive– Ensures that all documents generated are
valid according to the given DTD
XX
MM
LL
andand
KK
MM
WordPerfect
• Word processor with advanced support for authoring XML and SGML documents in a WYSIWYG environment
• Includes – Wizards– Automatic element insertion– Automatic generation of documents.
• The DTD, layout information, and mapping files are incorporated into a single WordPerfect template.
XX
MM
LL
andand
KK
MM
Content Management Systems
• Many CM systems repositories use XML under the hood for tagging and storing information
• Or can “speak” XML – export as XML to allow integration with other applications
• Open any trade magazine and see the standard vendor names proclaim their support for XML
• To the document creator, XML is “invisible”
XX
MM
LL
andand
KK
MM
XML Conversion Tools
Examples:
• Logictran RTF Converter
• HTML Tidy– Free Windows program– Converts HTML to XHTML or XML
XX
MM
LL
andand
KK
MM
Logictran RTF Converter
• Converts Word and RTF documents to HTML, XML, SGML
• The converter allows you to create output for any DTD.
• You can generate HTML, XHTML, OEB and Docbook.
XX
MM
LL
andand
KK
MM
XSLT Processors
• Means of converting files between XML dialects and other formats – MSXML built into Internet Explorer
• http://msdn.microsoft.com/xml
– Xalan • http://xml.apache.org/xalan-j/index.html
XX
MM
LL
andand
KK
MM
XML Parsers
Examples • Expat
– Written in C (ported to other languages), used by LIBWWW, Apache, …
• XML4J – from alphaWorks, in Java, based on
Apache Xerces, supports DOM and SAX
• Many other parsers
XX
MM
LL
andand
KK
MM
Servers
• Apache XML– xml.apache.org
built in Xerces XML parser, Xalan XSLT processor
XX
MM
LL
andand
KK
MM
Browsers
• Internet Explorer 6– XML support is fairly extensive– Namespaces are supported– Supports Style sheets in CSS as well as XSLT 1.0
Parser is still an issue
• Netscape 6.1– supports HTML 4.0, XML, CSS, DOM,
namespaces, simple Xlink – Does NOT support XSLT
• Opera – supports XML
XX
MM
LL
andand
KK
MM
XML Standards & Applications
• Many activities where XML has a role
• OASIS has an extensive list of applications – RSS (news headlines)– MathML– SMIL– DocBook
XX
MM
LL
andand
KK
MM
XML Standards – Multiplying Like Rabbits
• Software applications (transactions, interchange)
• Publishing
XX
MM
LL
andand
KK
MM
Software Applications
• Office tools and groupware
• Decision support systems
• Functional/transactional systems for HR, CRM ..
• Intelligent systems (ES, IPSS)
• User support
XX
MM
LL
andand
KK
MM
Publishing
• Digital rights (EBX,…)
• DocBook, e-book, TEI
• News (RSS, ICE, nift, NewsML)
• Special subject area formats (MathML, ChemML, CellML, GeneXML)
XX
MM
LL
andand
KK
MM
Publishing: News
• Web site news• Syndicated news• Headlines• Full text
KM applications• Integrating internal, external news, creating
auto-categorization of news, adding items to the news based on new additions to the repository, user profiling
ICE
RSS
NewsML
nift
XX
MM
LL
andand
KK
MM
RSS (Rich Site Summary)
CRM News www.moreover.com
• Web news format• Simple application• Take a look at the
bits and peices
XX
MM
LL
andand
KK
MM
RSS – Why?
• The Need– Quick, easy, and consistent
announcements pushed out to other sites– Incorporate news and other information
feeds on a site
XX
MM
LL
andand
KK
MM
How it works
XX
MM
LL
andand
KK
MM
Before RSS
• No standard
• Every one put up what was new and described it differently
• Special one off programs to create parsers and screen scrapers
XX
MM
LL
andand
KK
MM
The Result
• > 1700 sites sharing news
• Many sites re-posting the headlines
• Examples:• myuserland.com• www.moreover.com• xmlTree - directory of content
XX
MM
LL
andand
KK
MM
RSS Syntax
• RSS file has two major placeholders for data: channel and items.
XX
MM
LL
andand
KK
MM
Channel Element
• The channel element must contain the following:
• title or name of the channel, • short description of the channel, • link to the web site of the channel, and • the language that is encoding the web site.• Also, numerous optional elements can be
included with the channel, such as copyright, webmaster, publication date and so on.
XX
MM
LL
andand
KK
MM
Item Element
• RSS file can have up to 15 item elements. Item elements are used to store the headlines and are the meat of the document. Item elements have the following elements:
• title• link• description
XX
MM
LL
andand
KK
MM
RSS Code
• First line contains an XML declaration:
<?xml version="1.0"?> • The next item is the DTD identifier <!
DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/ formats/rss-0.91.dtd">
XX
MM
LL
andand
KK
MM
RSS Statement
• Next, the rss element – must specify the version attribute. – may contain an encoding attribute
• the default is UTF-8
<rss version="0.91" encoding= "ISO_8859-1">
XX
MM
LL
andand
KK
MM
Channel Definition
• Contains a single channel element.– Title, description, link to channel’s web site,
language, one or more item elements, lots of optional elements
<channel> <title>moreover... US politics news</title> <link>http://www.moreover.com</link> <description>US politics news - news headlines from around
the web, refreshed every 15 minutes</description> <language>en-us</language>
XX
MM
LL
andand
KK
MM
Item Elements
• Up to 15 item elements <item> <title>'Author Unknown' by Don Foster
</title> <link>http://www.salon.com/books/feature/2000/10/30/pbacks/index.html
</link> <description>Salon Nov 2 2000 6:51AM </description>
</item>
XX
MM
LL
andand
KK
MM
From Simple Documents to Complex
• Hierarchical
• Many objects and elements
• Many “namespaces”
XX
MM
LL
andand
KK
MM
Namespaces
• A single XML document may contain elements and attributes that are defined for and used by two or more XML-based languages without conflict or ambiguity
XX
MM
LL
andand
KK
MM
Example
<xmlns:book="http://www.oasis-open.org/docbook/
xml/4.1.2/docbookx.dtd">
<xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title>Working Knowledge</dc:title>
<dc:description>Overview and case studies of knowledge management</dc:description>
<book:chapter>5. Knowledge Transfer … </book:chapter>
XX
MM
LL
andand
KK
MM
OEB - Open E-Book
• In September 1999, the group published the Open E-Book 1.0 Publication Structure
• The Open E-book standard is essentially XHTML—that is, a clean version of HTML 4.0 along with support for CSS.
• www.openebook.org
XX
MM
LL
andand
KK
MM
RDF - Resource Description Framework
• Framework for metadata
• Interoperability of information exchange between applications
• Applications:– Resource discovery
– Knowledge sharing and exchange
– Content rating
– Intellectual property rights
XX
MM
LL
andand
KK
MM
RDF Example
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.0/"> <rdf:Description rdf:about="http://your.url" dc:creator=”Frank Cervone" dc:title="My RDF document" dc:description=”Exciting RDF Stuff." dc:date=”2000-11-10" /></rdf:RDF>
XX
MM
LL
andand
KK
MM
Emerging Standards For KM
• XTM
• OPML
• RFML
• FLBC
• ebXML
XX
MM
LL
andand
KK
MM
XTM: Topic Maps
• Used to organize information into knowledge bases
• Topic maps are a new ISO standard for describing knowledge structures and associating them with information resources
• “GPS” for information• http://www.topicmaps.org/xtm/
index.html“A book without an index is like a country without a map”
XX
MM
LL
andand
KK
MM
OPML
• Outline Processor Markup Language– Outline-structured information
• Used for data the is easily browsed and editable– Specifications– Legal briefs– Product plans– Presentations– Screenplays– Directories
XX
MM
LL
andand
KK
MM
RFML
• Relational-functional markup language
• Used to define relationship and functions among data elements– Tables within relational databases– Relational views
XX
MM
LL
andand
KK
MM
FLBC
• Formal Language for Business Communication– Automated communication – Conversation management– Dialog management– Based on speech act theory
• Formally defined message types• Broad range of message types• Defined in terms of intentions• Clear delineation between message type and content
XX
MM
LL
andand
KK
MM
XML in Use
• Portals
• Content management & syndication
• Content management: industry sector
• Integration
• Analytical/decision making
• Search and retrieval
• Visualization
XX
MM
LL
andand
KK
MM
Applications: Portals
• Portal are an obvious place for XML to be used. Most are integrating diverse data sources.
• Examples:– Hummingbird’s Enterprise Portal Suite
• allows XML-based third party application integration for variety of scripting languages
• Basically “write with your own tools/platform” exchange data with XML
– DataChannel, Sybase Enterprise Portal, Citrix XPS,
XX
MM
LL
andand
KK
MM
Content Production & Syndication
• Interwoven– Intranet/extranet content management and
authoring based on intelligent business rules, profiling etc.
– Newest component of Interwoven’s suite of tools focuses on content distribution and uses XML.
– OpenSyndicate uses a XML repository which allows content to be stored as objects and reused for multiple projects.
XX
MM
LL
andand
KK
MM
Open Syndicate
XX
MM
LL
andand
KK
MM
Content: Industry Specific Solutions
• Ringtail Solutions– Suite of litigation support and KM modules for
legal practitioner
XX
MM
LL
andand
KK
MM
Integration
• InfoShark– Used to integrate data from host of services and
programs, from 100’s to 1000’s of transactions each day
– Automates data exchange between Oracle, IBM DBW and Microsoft SQL for use over Internet, intranets, and extranets
– Being used by Montgomery county for eGov services of all types
XX
MM
LL
andand
KK
MM
Analytical/Decision Making
• Spotfire– DecisionSite 6.2 is powered by XML-based
application manager to tools, guides, resources for Genomics, Chemistry And Manufacturing
XX
MM
LL
andand
KK
MM
Visualization
• Antarcti.ca– visual mapping technology provides enterprises
with data search and discovery,
XX
MM
LL
andand
KK
MM
Not a Silver Bullet
“XML is not the answer to all the world’s problems—it creates new problems, that are awfully damn interesting to solve.”
Simon St. Laurent,
author of XML: A Primer,
on the xml-dev mailing list
XX
MM
LL
andand
KK
MM
Thank you!
• Frank CervoneAssistant University Librarian for Information
Technology, Northwestern University
• Darlene Fichter Data Library Coordinator, University of
Saskatchewan