+ All Categories
Home > Documents > XML for Information Management – Day 5 Airi Salminen XML for Information Management University of...

XML for Information Management – Day 5 Airi Salminen XML for Information Management University of...

Date post: 11-Jan-2016
Category:
Upload: brianna-morton
View: 218 times
Download: 2 times
Share this document with a friend
47
XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/ 26.4.-30.4.2010
Transcript
Page 1: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

XML for Information Management

University of Erlangen-NurembergComputational Linguistics

Instructor: Professor Airi Salminenhttp://users.jyu.fi/~airi/

26.4.-30.4.2010

Page 2: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

2

1. Types of XML use1.1 Document format1.2 Special kinds of

primary data1.3 Metadata1.4 Data interchange

2. Benefits3. Challenges

Outline

Page 3: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

3

1. Types of XML use

Format for information assets

documents and Web pages

special kinds of primary data

metadata

Format for data interchange

Page 4: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

4

Examples in public sector

• Document format: Finnish Parliamentary documents

• Metadata format: XML schemas of the British e-GIF service (http://www.govtalk.gov.uk/ ))

• Data interchange format: Finnish e-Government services Suomi.fi, Lomake.fi

1. Types of XML use

Page 5: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

5

• Operation and maintenance manual of a paper machine in a paper factory

• Parliamentary documents

• Articles of a news archive

• Office documents (e.g. reports, memos, meeting notices and minutes, slides)

Examples

R. Jelliffe, Comparing XML office document formats: HTML, ODF, WordML, FO, Word2007, 2006http://www.oreillynet.com/xml/blog/2006/07/comparing_xml_office_document.html

1.1 Document format

Page 6: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

6

• Developed at OASIS (Organization for the Advancement of Structured Information Standards)

• Open XML-based file format for office applications to be used for documents containing text, spreadsheets, charts,and graphical elements

• OpenDocument v1.0 has been approved as an ISO standard

• Specification defined by three RELAX-NG schemas

OpenDocument Format (ODF)http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office

Competing formats: Microsoft originating XML formats called Office Open XML (OOXML), approved as ISO standards 2008.

1.1 Document format

Page 7: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

7

The office standards ODF and OOXML contain a number of built-in schemas for office documents, but

in the office environments supporting ODF and OOXML you may also define and use user-defined schemas.

1.1 Document format

Page 8: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

8

• long-term accessibility of information

• independency of particular software providers

• consistency and correctness in documents

• interoperability of systems

• information retrieval, multichannel publishing

Why XML format?

1.1 Document format

Page 9: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

9

• audio, video, graphics

• mathematics

• chemical data

• music

• geospatial data

• health information

• humanistic data

1.2 Special kinds of primary data

Examples

Page 10: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

10

• to support data management on special application domains with specialized data content

• interoperability over various domains, devices, and software systems

1.2 Special kinds of primary data

Why XML format?

Page 11: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

11

1.2 Special kinds of primary data

May be used to define rich constraints for elements and attributes.

Datatypes to constrain attribute values and character data content of elements.

Built-in datatypes and user-derived datatypes.

XML Schema

Built-in datatypes:

string, normalizedString, token, language

decimal, float, double, integer, nonPositiveInteger, …

duration, dateTime, time, date, gYearMonth, gDay, gMonth

NMTOKEN, NMTOKENS, ID, IDREFS, …

anyURI, QName, boolean, …

Page 12: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

12

1.2 Special kinds of primary data

Graphics: SVG Mozilla Firefox and Opera provide SVG viewing

capabilities for viewing separate SVG documents or SVG fragments on HTML/XHTML pages.

Multimedia & other time-dependent data SMIL (Synchronized Multimedia Language)

VoiceXML (Voice Extensible Markup Language)

SSML (Speech Synthesis Markup Language)

EMMA (Extensible MultiModal Annotation markup language)

TTML (Timed Text Markup Language)

A SMIL example at w3schools.com siteInternet Explorer shows the animation

Page 13: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

13

1.2 Special kinds of primary data

OGC (Open Geospatial Consortium) is the most important standardization organization for geographic XML applications

GML (Geography Markup Language)

KML (Developed and used by Google)

GeoXAML (Geospatial eXtensible Access Control Markup Language)

CityGML (City Geography Markup Language)

Geospatial data

Page 14: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

14

1.2 Special kinds of primary data

Humanistic data

most of the content consists of natural language text

content produced by writers, poets, researchers, teachers

published in books, journals, and other publications

analyzed by humanistic researchers (literature, linguistics, history, religion, philosophy, languages, etc.)

Page 15: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

15

1.2 Special kinds of primary data

pdf obviously most popular distribution format today

also HTML/XHTML used

EPUB is an XML-based standard developed by IDPF (International Digital Publishing Forum, www.idpf.org)

EPUB is intended to improve the interoperability of different reading devices and software products

Books in EPUB format available, for example, at www.gutenberg.org

Electronic books

Page 16: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

16

1.2 Special kinds of primary data

intended to provide a framework for encoding “any genre of text from any period in any language”, primarily for humanistic research

versions P1-P3 are based on SGML, version P4 included rules both for SGML markup and XML markup, version P5 provides rules for XML markup only

markup decisions made by a text encoder who has to be an expert of the text domain

there can be very different kinds of markup decisions for the same text

TEI (Text Encoding Initiative, www.tei-c.org)

Page 17: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

17

1.2 Special kinds of primary data

core: elements common to all TEI documents, including elements for paragraphs, denoting emphasized or foreign text, and quotations

verse: verse structures to encode verse lines and line groups

drama: performance texts for printed dramatic texts, screen plays, or radio scripts

dictionaries: monolingual and multilingual dictionaries and glossaries

The TEI markup rules are provided in 22 modules defining more than 500 elements together with attributes.

Provides also the procedures by which to choose, customize, and extend the modules for specific needs.

Page 18: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

18

1.2 Special kinds of primary data

TEI examples: TEI By Example project

there can be many different kinds of markup decisions for the same text

the hierarchic structure of XML documents is not always well suited; special solutions needed, for example, for non-hierarchic, overlapping, and discontinuous structures

Characteristic feature in many humanistic texts: structural complexity

Page 19: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

19

1.3 Metadata

Metadata concept

data about data

“structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource” (Understanding Metadata, at www.niso.org)

Another good introduction:

Introduction to Metadata, www.getty.edu-sivustolla

Page 20: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

20

1.3 Metadata

XML markup in a document serves as metadata in relationship to the character data and unparsed entities (that together can be called the primary content of the document).

When XML is used for metadata, then the metadata may be related to any information object. The metadata may be provided as a separate XML document, or, if the information object is an XML document, then the metadata may be embedded in the document.

XML as metadata vs. XML for metadata

Page 21: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

21

Examples of metadata

• information about each document in an organization’s archive, including the author, date of creation, version, and keywords for each document

• access control, distribution information, and retention information for each active document in an organization

• the schemas for the XML documents in an organization

• the relational and other database schemas of the organization

• transaction logs for an organization, whether related to document updates or inventory control

• a dictionary of essential terms used in the organization

• ontology (concepts and their relationships) on a particular domain

1.3 Metadata

Page 22: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

22

1.3 Metadata

meta element in an XHTML document provides metadata about the document

METS (Metadata Encoding and Transmission Standards) and MODS (Metadata Object Description Schema) for the metadata of library objects

LOM (Learning Object Metadata) for the management of learning objects

Examples of the use of XML for metadata

examples can be found by Google, for example:

• filetype:xml mets

• filetype:xml mods

Page 23: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

23

• integration of data, systems, and services

• building Semantic Web

• long-term accessibility of information

• security and trust

1.3 Metadata

Why XML format?

Page 24: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

24

1.3 Metadata

The Semantic Web provides a common framework that allows data to be shared and reused across application,

enterprise, and community boundaries. It is a collaborative effort led by W3C with participation from a large number of

researchers and industrial partners. It is based on the Resource Description Framework (RDF), which integrates a variety of applications using XML for syntax and URIs for

naming.

W3C Semantic Web Activity, http://www.w3.org/2001/sw/

Page 25: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

25

1.3 Metadata

Semantic Web = Web resources + rich metadata

Page 26: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

26

1.3 Metadata

primary resources metadataresources

applications

Semantic Web technologies

SW architecture

Page 27: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

27

1.3 Metadata

SW architecture

primary resources

DTDsXML SchemataRDF SchemataRDF RepositoriesOntologiesAnnotations

applications

URI, Unicode, XML, XML Namespaces, XML Schema, RDF, RDF Schema, XML-Signature, OWL, ...

Page 28: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

28

1.3 Metadata

The Resource Description Framework (RDF) is a model and a language for representing information about resources that can be identified on the Web.

RDF is based on the idea of identifying things using URIs, and describing resources in terms of simple properties and property values. This enables RDF to represent simple statements about resources.

RDF Primer

Page 29: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

29

1.3 Metadata

Frank Manola is an editor of RDF Primer

A statement about Frank Manola:

resource property value

more formal presentation: (Frank Manola, editor, RDF Primer)

Still more formal presentation is needed for Web.

Page 30: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

30

1.3 Metadata

Anything having identity by a URI.

Resources are referred to by their URIs.

For example, for being able to make statements about a person whose name is Frank Manola we need a URI for him.

resource in RDF

PROBLEM: How do we identify individual people by URIs? So far, there is no universal URI identification mechanism for individuals.

Page 31: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

31

1.3 Metadata

a set of (subject, predicate, object) triples, graphically described by nodes and directed arcs

RDF graph

subject (resource): URI reference (or empty node)

predicate (property): URI reference

object (property value): URI reference or literal (or empty node)

subjectpredicate

object

Page 32: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

32

1.3 Metadata

• Let us identify Frank Manola by http://acm.org/people/fmanola

• Let us identify the property editor by http://erl.example.org/terms/editor

• Let us give RDF Primer by a literal character string ”RDF Primer”

http://erl.example.org/terms/editor

http://acm.org/people/fmanola RDF Primer

Page 33: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

33

1.3 Metadata

If we do not know URI for Frank Manola but we know something about him, we can use empty node to gather properties about him:

http://erl.example.org/terms/editor

RDF Primer

http://erl.example.org/terms/fullName

Frank Manola

http://erl.example.org/terms/homePage

http://www.objs.com/manola.htm

Page 34: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

34

1.3 Metadata

If we know URI for the RDF Primer, we can describe properties of it:

http://erl.example.org/terms/editor

RDF Primer

http://erl.example.org/terms/fullName

Frank Manola

http://erl.example.org/terms/homePage

http://www.objs.com/manola.htm

http://www.w3.org/TR/rdf-primer/

http://erl.example.org/terms/title

Page 35: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

35

1.3 Metadata

Rather than inventing all URIs and terms in them, we could use those already invented

Dublin Core: http://purl.org/dc/elements/1.1/Example namespace of the RDF/XML Syntax Specification: http://example.org/stuff/1.0/

http://example.org/stuff/1.0/homePage

http://example.org/stuff/1.0/editor

RDF Primerhttp://example.org/stuff/1.0/fullName

Frank Manolahttp://www.objs.com/manola.htm

http://www.w3.org/TR/rdf-primer/

http://purl.org/dc/elements/1.1/title

Page 36: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

36

1.3 Metadata

XML Syntax for RDF

<?xml version="1.0"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:erl="http://erl.example.org/terms/"><rdf:Description rdf:about="http://acm.org/people/fmanola"> <erl:editor>RDF Primer</erl:editor></rdf:Description> </rdf:RDF>

http://erl.example.org/terms/editor

http://acm.org/people/fmanola RDF Primer

Page 37: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

37

1.3 Metadata

formally defined concepts (with their relationships) on a domain

provides shared understanding of the domain

different degrees of formality; SW goal: machine understandable ontologies

may include rules for reasoning

Major portion of Semantic Web metadata can be regarded as various kinds of ontologies.

Ontology

Languages to define ontologies: OWL, RDF Schema

Page 38: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

38

1.3 Metadata

Semantic Web Case Studies and Use Cases

http://www.w3.org/2001/sw/sweo/public/UseCases/

Page 39: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

39

Examples

• data exported from a relational database to an object-oriented database

• an invoice sent from a company to another company

1.4 Data interchange

Page 40: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

40

• integration of systems by a common user interface

• integration of services by a portal

• data transmission between software systems within an organization

• data transmission between software systems among distinct organizations

1.4 Data interchange

Why XML format?

Page 41: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

41

2. Benefits

• Collaborative standardization

• XML family of languages

• Variety of software

• Application-independent data assets

• Web-enabling

• Interoperability

Page 42: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

42

Universal • rules for wide use for different sectors and domains • e.g. URI, Unicode, XML, XML Names, XML Schema, XSLT, XQuery• development by standards organizations like W3C, IETC, ISO, Unicode

Consortium, IANA

Sectoral • rules for the adoption of XML for the purposes of a specific sector or application

domain • e.g. electronic commerce, health care, finance• development by industry organizations like OASIS, XBRL International, HL7, or by

public sector organizations like Office of Public Sector Information in Great Britain

Local• rules and practices for the adoption and implementation of XML in a particular

organization or group of organizations • development by the organizations

2. Benefits

Collaborative standardization

Page 43: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

43

2. Benefits

Rich variety of software enabled by

• SGML inheritance

• Open standards

• Strong support for programming, e.g. DOM, SAX

• Collaborative development

• Modular standard development

Page 44: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

44

2. Benefits

Interoperability

ability to exchange information in an efficient and uniform manner across software systems, organizations, technical environments, business processes, and cultural environments with their different natural languages

Page 45: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

45

3. Challenges

• Continuous changes in XML-related specifications and software

• Parallel development of related / competing specifications at W3C and industry sectors

• Often need to use (and depend on) universal or sectoral level specifications before they are finalized

• Standards once implemented for an environment require maintenance

Page 46: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

46

• XML standardization in an organization often requires effective collaboration; good communication skills needed

• Developing and finding agreements about ontologies and document structures may be extremely hard

• In document standardization the document production practices and tools may radically change

• Finding efficient solutions for persistent storage of big XML data repositories may be difficult

3. Challenges

Page 47: XML for Information Management – Day 5 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor:

XML for Information Management – Day 5Airi Salminen

47

Thank you!


Recommended