Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | brianna-morton |
View: | 218 times |
Download: | 2 times |
XML for Information Management – Day 5Airi Salminen
XML for Information Management
University of Erlangen-NurembergComputational Linguistics
Instructor: Professor Airi Salminenhttp://users.jyu.fi/~airi/
26.4.-30.4.2010
XML for Information Management – Day 5Airi Salminen
2
1. Types of XML use1.1 Document format1.2 Special kinds of
primary data1.3 Metadata1.4 Data interchange
2. Benefits3. Challenges
Outline
XML for Information Management – Day 5Airi Salminen
3
1. Types of XML use
Format for information assets
documents and Web pages
special kinds of primary data
metadata
Format for data interchange
XML for Information Management – Day 5Airi Salminen
4
Examples in public sector
• Document format: Finnish Parliamentary documents
• Metadata format: XML schemas of the British e-GIF service (http://www.govtalk.gov.uk/ ))
• Data interchange format: Finnish e-Government services Suomi.fi, Lomake.fi
1. Types of XML use
XML for Information Management – Day 5Airi Salminen
5
• Operation and maintenance manual of a paper machine in a paper factory
• Parliamentary documents
• Articles of a news archive
• Office documents (e.g. reports, memos, meeting notices and minutes, slides)
Examples
R. Jelliffe, Comparing XML office document formats: HTML, ODF, WordML, FO, Word2007, 2006http://www.oreillynet.com/xml/blog/2006/07/comparing_xml_office_document.html
1.1 Document format
XML for Information Management – Day 5Airi Salminen
6
• Developed at OASIS (Organization for the Advancement of Structured Information Standards)
• Open XML-based file format for office applications to be used for documents containing text, spreadsheets, charts,and graphical elements
• OpenDocument v1.0 has been approved as an ISO standard
• Specification defined by three RELAX-NG schemas
OpenDocument Format (ODF)http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office
Competing formats: Microsoft originating XML formats called Office Open XML (OOXML), approved as ISO standards 2008.
1.1 Document format
XML for Information Management – Day 5Airi Salminen
7
The office standards ODF and OOXML contain a number of built-in schemas for office documents, but
in the office environments supporting ODF and OOXML you may also define and use user-defined schemas.
1.1 Document format
XML for Information Management – Day 5Airi Salminen
8
• long-term accessibility of information
• independency of particular software providers
• consistency and correctness in documents
• interoperability of systems
• information retrieval, multichannel publishing
Why XML format?
1.1 Document format
XML for Information Management – Day 5Airi Salminen
9
• audio, video, graphics
• mathematics
• chemical data
• music
• geospatial data
• health information
• humanistic data
1.2 Special kinds of primary data
Examples
XML for Information Management – Day 5Airi Salminen
10
• to support data management on special application domains with specialized data content
• interoperability over various domains, devices, and software systems
1.2 Special kinds of primary data
Why XML format?
XML for Information Management – Day 5Airi Salminen
11
1.2 Special kinds of primary data
May be used to define rich constraints for elements and attributes.
Datatypes to constrain attribute values and character data content of elements.
Built-in datatypes and user-derived datatypes.
XML Schema
Built-in datatypes:
string, normalizedString, token, language
decimal, float, double, integer, nonPositiveInteger, …
duration, dateTime, time, date, gYearMonth, gDay, gMonth
NMTOKEN, NMTOKENS, ID, IDREFS, …
anyURI, QName, boolean, …
XML for Information Management – Day 5Airi Salminen
12
1.2 Special kinds of primary data
Graphics: SVG Mozilla Firefox and Opera provide SVG viewing
capabilities for viewing separate SVG documents or SVG fragments on HTML/XHTML pages.
Multimedia & other time-dependent data SMIL (Synchronized Multimedia Language)
VoiceXML (Voice Extensible Markup Language)
SSML (Speech Synthesis Markup Language)
EMMA (Extensible MultiModal Annotation markup language)
TTML (Timed Text Markup Language)
A SMIL example at w3schools.com siteInternet Explorer shows the animation
XML for Information Management – Day 5Airi Salminen
13
1.2 Special kinds of primary data
OGC (Open Geospatial Consortium) is the most important standardization organization for geographic XML applications
GML (Geography Markup Language)
KML (Developed and used by Google)
GeoXAML (Geospatial eXtensible Access Control Markup Language)
CityGML (City Geography Markup Language)
Geospatial data
XML for Information Management – Day 5Airi Salminen
14
1.2 Special kinds of primary data
Humanistic data
most of the content consists of natural language text
content produced by writers, poets, researchers, teachers
published in books, journals, and other publications
analyzed by humanistic researchers (literature, linguistics, history, religion, philosophy, languages, etc.)
XML for Information Management – Day 5Airi Salminen
15
1.2 Special kinds of primary data
pdf obviously most popular distribution format today
also HTML/XHTML used
EPUB is an XML-based standard developed by IDPF (International Digital Publishing Forum, www.idpf.org)
EPUB is intended to improve the interoperability of different reading devices and software products
Books in EPUB format available, for example, at www.gutenberg.org
Electronic books
XML for Information Management – Day 5Airi Salminen
16
1.2 Special kinds of primary data
intended to provide a framework for encoding “any genre of text from any period in any language”, primarily for humanistic research
versions P1-P3 are based on SGML, version P4 included rules both for SGML markup and XML markup, version P5 provides rules for XML markup only
markup decisions made by a text encoder who has to be an expert of the text domain
there can be very different kinds of markup decisions for the same text
TEI (Text Encoding Initiative, www.tei-c.org)
XML for Information Management – Day 5Airi Salminen
17
1.2 Special kinds of primary data
core: elements common to all TEI documents, including elements for paragraphs, denoting emphasized or foreign text, and quotations
verse: verse structures to encode verse lines and line groups
drama: performance texts for printed dramatic texts, screen plays, or radio scripts
dictionaries: monolingual and multilingual dictionaries and glossaries
The TEI markup rules are provided in 22 modules defining more than 500 elements together with attributes.
Provides also the procedures by which to choose, customize, and extend the modules for specific needs.
XML for Information Management – Day 5Airi Salminen
18
1.2 Special kinds of primary data
TEI examples: TEI By Example project
there can be many different kinds of markup decisions for the same text
the hierarchic structure of XML documents is not always well suited; special solutions needed, for example, for non-hierarchic, overlapping, and discontinuous structures
Characteristic feature in many humanistic texts: structural complexity
XML for Information Management – Day 5Airi Salminen
19
1.3 Metadata
Metadata concept
data about data
“structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource” (Understanding Metadata, at www.niso.org)
Another good introduction:
Introduction to Metadata, www.getty.edu-sivustolla
XML for Information Management – Day 5Airi Salminen
20
1.3 Metadata
XML markup in a document serves as metadata in relationship to the character data and unparsed entities (that together can be called the primary content of the document).
When XML is used for metadata, then the metadata may be related to any information object. The metadata may be provided as a separate XML document, or, if the information object is an XML document, then the metadata may be embedded in the document.
XML as metadata vs. XML for metadata
XML for Information Management – Day 5Airi Salminen
21
Examples of metadata
• information about each document in an organization’s archive, including the author, date of creation, version, and keywords for each document
• access control, distribution information, and retention information for each active document in an organization
• the schemas for the XML documents in an organization
• the relational and other database schemas of the organization
• transaction logs for an organization, whether related to document updates or inventory control
• a dictionary of essential terms used in the organization
• ontology (concepts and their relationships) on a particular domain
1.3 Metadata
XML for Information Management – Day 5Airi Salminen
22
1.3 Metadata
meta element in an XHTML document provides metadata about the document
METS (Metadata Encoding and Transmission Standards) and MODS (Metadata Object Description Schema) for the metadata of library objects
LOM (Learning Object Metadata) for the management of learning objects
Examples of the use of XML for metadata
examples can be found by Google, for example:
• filetype:xml mets
• filetype:xml mods
XML for Information Management – Day 5Airi Salminen
23
• integration of data, systems, and services
• building Semantic Web
• long-term accessibility of information
• security and trust
1.3 Metadata
Why XML format?
XML for Information Management – Day 5Airi Salminen
24
1.3 Metadata
The Semantic Web provides a common framework that allows data to be shared and reused across application,
enterprise, and community boundaries. It is a collaborative effort led by W3C with participation from a large number of
researchers and industrial partners. It is based on the Resource Description Framework (RDF), which integrates a variety of applications using XML for syntax and URIs for
naming.
W3C Semantic Web Activity, http://www.w3.org/2001/sw/
XML for Information Management – Day 5Airi Salminen
25
1.3 Metadata
Semantic Web = Web resources + rich metadata
XML for Information Management – Day 5Airi Salminen
26
1.3 Metadata
primary resources metadataresources
applications
Semantic Web technologies
SW architecture
XML for Information Management – Day 5Airi Salminen
27
1.3 Metadata
SW architecture
primary resources
DTDsXML SchemataRDF SchemataRDF RepositoriesOntologiesAnnotations
applications
URI, Unicode, XML, XML Namespaces, XML Schema, RDF, RDF Schema, XML-Signature, OWL, ...
XML for Information Management – Day 5Airi Salminen
28
1.3 Metadata
The Resource Description Framework (RDF) is a model and a language for representing information about resources that can be identified on the Web.
RDF is based on the idea of identifying things using URIs, and describing resources in terms of simple properties and property values. This enables RDF to represent simple statements about resources.
RDF Primer
XML for Information Management – Day 5Airi Salminen
29
1.3 Metadata
Frank Manola is an editor of RDF Primer
A statement about Frank Manola:
resource property value
more formal presentation: (Frank Manola, editor, RDF Primer)
Still more formal presentation is needed for Web.
XML for Information Management – Day 5Airi Salminen
30
1.3 Metadata
Anything having identity by a URI.
Resources are referred to by their URIs.
For example, for being able to make statements about a person whose name is Frank Manola we need a URI for him.
resource in RDF
PROBLEM: How do we identify individual people by URIs? So far, there is no universal URI identification mechanism for individuals.
XML for Information Management – Day 5Airi Salminen
31
1.3 Metadata
a set of (subject, predicate, object) triples, graphically described by nodes and directed arcs
RDF graph
subject (resource): URI reference (or empty node)
predicate (property): URI reference
object (property value): URI reference or literal (or empty node)
subjectpredicate
object
XML for Information Management – Day 5Airi Salminen
32
1.3 Metadata
• Let us identify Frank Manola by http://acm.org/people/fmanola
• Let us identify the property editor by http://erl.example.org/terms/editor
• Let us give RDF Primer by a literal character string ”RDF Primer”
http://erl.example.org/terms/editor
http://acm.org/people/fmanola RDF Primer
XML for Information Management – Day 5Airi Salminen
33
1.3 Metadata
If we do not know URI for Frank Manola but we know something about him, we can use empty node to gather properties about him:
http://erl.example.org/terms/editor
RDF Primer
http://erl.example.org/terms/fullName
Frank Manola
http://erl.example.org/terms/homePage
http://www.objs.com/manola.htm
XML for Information Management – Day 5Airi Salminen
34
1.3 Metadata
If we know URI for the RDF Primer, we can describe properties of it:
http://erl.example.org/terms/editor
RDF Primer
http://erl.example.org/terms/fullName
Frank Manola
http://erl.example.org/terms/homePage
http://www.objs.com/manola.htm
http://www.w3.org/TR/rdf-primer/
http://erl.example.org/terms/title
XML for Information Management – Day 5Airi Salminen
35
1.3 Metadata
Rather than inventing all URIs and terms in them, we could use those already invented
Dublin Core: http://purl.org/dc/elements/1.1/Example namespace of the RDF/XML Syntax Specification: http://example.org/stuff/1.0/
http://example.org/stuff/1.0/homePage
http://example.org/stuff/1.0/editor
RDF Primerhttp://example.org/stuff/1.0/fullName
Frank Manolahttp://www.objs.com/manola.htm
http://www.w3.org/TR/rdf-primer/
http://purl.org/dc/elements/1.1/title
XML for Information Management – Day 5Airi Salminen
36
1.3 Metadata
XML Syntax for RDF
<?xml version="1.0"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:erl="http://erl.example.org/terms/"><rdf:Description rdf:about="http://acm.org/people/fmanola"> <erl:editor>RDF Primer</erl:editor></rdf:Description> </rdf:RDF>
http://erl.example.org/terms/editor
http://acm.org/people/fmanola RDF Primer
XML for Information Management – Day 5Airi Salminen
37
1.3 Metadata
formally defined concepts (with their relationships) on a domain
provides shared understanding of the domain
different degrees of formality; SW goal: machine understandable ontologies
may include rules for reasoning
Major portion of Semantic Web metadata can be regarded as various kinds of ontologies.
Ontology
Languages to define ontologies: OWL, RDF Schema
XML for Information Management – Day 5Airi Salminen
38
1.3 Metadata
Semantic Web Case Studies and Use Cases
http://www.w3.org/2001/sw/sweo/public/UseCases/
XML for Information Management – Day 5Airi Salminen
39
Examples
• data exported from a relational database to an object-oriented database
• an invoice sent from a company to another company
1.4 Data interchange
XML for Information Management – Day 5Airi Salminen
40
• integration of systems by a common user interface
• integration of services by a portal
• data transmission between software systems within an organization
• data transmission between software systems among distinct organizations
1.4 Data interchange
Why XML format?
XML for Information Management – Day 5Airi Salminen
41
2. Benefits
• Collaborative standardization
• XML family of languages
• Variety of software
• Application-independent data assets
• Web-enabling
• Interoperability
XML for Information Management – Day 5Airi Salminen
42
Universal • rules for wide use for different sectors and domains • e.g. URI, Unicode, XML, XML Names, XML Schema, XSLT, XQuery• development by standards organizations like W3C, IETC, ISO, Unicode
Consortium, IANA
Sectoral • rules for the adoption of XML for the purposes of a specific sector or application
domain • e.g. electronic commerce, health care, finance• development by industry organizations like OASIS, XBRL International, HL7, or by
public sector organizations like Office of Public Sector Information in Great Britain
Local• rules and practices for the adoption and implementation of XML in a particular
organization or group of organizations • development by the organizations
2. Benefits
Collaborative standardization
XML for Information Management – Day 5Airi Salminen
43
2. Benefits
Rich variety of software enabled by
• SGML inheritance
• Open standards
• Strong support for programming, e.g. DOM, SAX
• Collaborative development
• Modular standard development
XML for Information Management – Day 5Airi Salminen
44
2. Benefits
Interoperability
ability to exchange information in an efficient and uniform manner across software systems, organizations, technical environments, business processes, and cultural environments with their different natural languages
XML for Information Management – Day 5Airi Salminen
45
3. Challenges
• Continuous changes in XML-related specifications and software
• Parallel development of related / competing specifications at W3C and industry sectors
• Often need to use (and depend on) universal or sectoral level specifications before they are finalized
• Standards once implemented for an environment require maintenance
XML for Information Management – Day 5Airi Salminen
46
• XML standardization in an organization often requires effective collaboration; good communication skills needed
• Developing and finding agreements about ontologies and document structures may be extremely hard
• In document standardization the document production practices and tools may radically change
• Finding efficient solutions for persistent storage of big XML data repositories may be difficult
3. Challenges
XML for Information Management – Day 5Airi Salminen
47
Thank you!