+ All Categories
Home > Documents > 01 - Intro to XML 1-1 · - Multi-channel publication (on paper and on the web), search,...

01 - Intro to XML 1-1 · - Multi-channel publication (on paper and on the web), search,...

Date post: 06-Aug-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
7
9/9/09 1 Introduction to XML for parliamentary documents (and all other kinds of documents, actually) Prof. Fabio Vitali University of Bologna Part 1 Next: Parliamentary activities 2/38 Purpose of these slides Introduce the principal aspects of electronic management of documents - What we actually mean by documents (the FRBR hierarchy) - What are the components of documents - What do we mean by data and metadata about documents Introduce some technologies related to electronic management of documents… - XML - DTDs - XML Schema - XSLT - RDF and OWL … all somehow connected and related to parliamentary documents (but not necessarily only to them) Next: Computer support for parliamentary activities 3/38 Parliamentary activities A complex production system that generates documents with dierent legal status: - Bills and acts to become the law of a country - Debate records (or hansards) to become a lasting log of the activities of the parliament - Daily and weekly announcements, tablings and reports to organize and master the internal logistics Each document of the uttermost importance and to be printed in large quantities and/or made available to a wide part of the population daily and in a very limited amount of time Next: How the Web can help 4/38 Computer support for parliamentary activities Support for generating documents - Drafting activities, record keeping, translation into national languages, etc. Support for workflow - Management of documents across lifecycle, storage, security, timely involvement of relevant individuals and oces Support for citizens’ access - Multi-channel publication (on paper and on the web), search, classification, identification Further activities - Consolidation, comparison, language synchronization, etc. Next: XML 5/38 How the Web can help Born as a publishing medium HTML helped make it a big success HTML is constraining by its own simplicity - Excessive reliance on typographic rather than semantic description - Few rules that are not even strongly imposed A new language was invented, called XML, that could solve that - Clear dierentiation between aspect and meaning - Strong syntactic rules heavily imposed to guarantee uniformity, homogeneity, sophisticated applications Next: Parliamentary documents and XML 6/38 XML XML (Extensible Markup Language) is a W3C standard of incredibly widespread diusion. XML is pure syntax, without pre-defined semantics. This allows document designers to provide their own semantics. Thanks to the associated languages (DTD, XSLT, RDF) we can create sophisticated applications with big flexibility in uses. XML allows to create markup languages that are readable, generic, structured, hierarchical.
Transcript
Page 1: 01 - Intro to XML 1-1 · - Multi-channel publication (on paper and on the web), search, classification, identification Further activities- Consolidation, comparison, language synchronization,

9/9/09

1

Introduction to XML for parliamentary documents(and all other kinds of documents, actually)

Prof. Fabio Vitali University of Bologna

Part 1 Next: Parliamentary activities 2/38

Purpose of these slides ●  Introduce the principal aspects of electronic management of

documents -  What we actually mean by documents (the FRBR hierarchy) -  What are the components of documents -  What do we mean by data and metadata about documents

●  Introduce some technologies related to electronic management of documents…

-  XML -  DTDs -  XML Schema -  XSLT -  RDF and OWL

●  … all somehow connected and related to parliamentary documents (but not necessarily only to them)

Next: Computer support for parliamentary activities 3/38

Parliamentary activities

●  A complex production system that generates documents with different legal status:

-  Bills and acts to become the law of a country -  Debate records (or hansards) to become a lasting

log of the activities of the parliament -  Daily and weekly announcements, tablings and

reports to organize and master the internal logistics

●  Each document of the uttermost importance and to be printed in large quantities and/or made available to a wide part of the population daily and in a very limited amount of time

Next: How the Web can help 4/38

Computer support for parliamentary activities

●  Support for generating documents -  Drafting activities, record keeping, translation into

national languages, etc. ●  Support for workflow

-  Management of documents across lifecycle, storage, security, timely involvement of relevant individuals and offices

●  Support for citizens’ access -  Multi-channel publication (on paper and on the web),

search, classification, identification ●  Further activities

-  Consolidation, comparison, language synchronization, etc.

Next: XML 5/38

How the Web can help ●  Born as a publishing medium ●  HTML helped make it a big success ●  HTML is constraining by its own simplicity

-  Excessive reliance on typographic rather than semantic description

-  Few rules that are not even strongly imposed ●  A new language was invented, called XML, that could

solve that -  Clear differentiation between aspect and meaning -  Strong syntactic rules heavily imposed to guarantee

uniformity, homogeneity, sophisticated applications

Next: Parliamentary documents and XML 6/38

XML

●  XML (Extensible Markup Language) is a W3C standard of incredibly widespread diffusion.

●  XML is pure syntax, without pre-defined semantics. This allows document designers to provide their own semantics.

●  Thanks to the associated languages (DTD, XSLT, RDF) we can create sophisticated applications with big flexibility in uses.

●  XML allows to create markup languages that are readable, generic, structured, hierarchical.

Page 2: 01 - Intro to XML 1-1 · - Multi-channel publication (on paper and on the web), search, classification, identification Further activities- Consolidation, comparison, language synchronization,

9/9/09

2

Next: Why is XML good? 7/38

Parliamentary documents and XML

●  XML is ideal for representing parliamentary documents (and especially bills and acts):

-  They have a well-defined structure, which is systematic and standardized

-  There are required and optional parts according to rules and tradition

-  There are containment constraints that determine the global correctness of the document

-  There are references to other texts (schedules, other acts, etc.) that can fruitfully be used to create a hypertext network.

Next: Documents 8/38

Conve

rsion

is di

fficu

lt

Conversion is very easy

Energy / Information

Why is XML good?

Documents

Next: 3 problems (2) 10/38

3 problems ●  When are two documents the same document?

●  What is important to record of a document?

●  How do we refer precisely to the normative content of the documents and of their parts?

Next: 3 problems (2) 11/38

3 problems (2)

●  When are two documents the same document? -  When they are different physical copies of the same

document (two identical books) -  When they are different ways by which the same words

appear (a MS Word file and its printout on paper) -  When they are two different set of sentences with the

same name and purpose (two versions of the same act) ●  What is important to record of a document?

●  How do we refer precisely to the normative content of the documents and of their parts?

Next: 3 problems (2) 12/38

3 problems (2) ●  When are two documents the same document?

●  What is important to record of a document? -  The words and punctuation it is composed of. -  The way in which is is shown on page (pagination, typography,

colors, margins and fonts) -  The conceptual role of each fragment (this sentence is a title,

this is a clause, this is a reference, etc.) ●  How do we refer precisely to the normative content of

the documents and of their parts?

Page 3: 01 - Intro to XML 1-1 · - Multi-channel publication (on paper and on the web), search, classification, identification Further activities- Consolidation, comparison, language synchronization,

9/9/09

3

Next: 3 solutions 13/38

3 problems (2) ●  When are two documents the same document?

●  What is important to record of a document?

●  How do we refer precisely to the normative content of the documents and of their parts?

-  The meaning -  The words -  The name

Next: The IFLA FRBR hierarchy (1) 14/38

3 solutions ●  When are two documents the same document?

-  The IFLA FRBR hierarchy: from abstract ideas to physical files

●  Work, Expression, Manifestation, Item

●  What is important to record of a document? -  The SGML components: from meaning to typography

●  Content, presentation, structure

●  How do we refer precisely to the normative content of the documents and of their parts?

-  The semantic web approach: applying semantics where it fits

●  data, metadata, ontology

Next: The IFLA FRBR hierarchy (2) 15/38

The IFLA FRBR hierarchy (1) ●  Work: a distinct intellectual creation. ●  Expression: the specific form in which a work is realized

-  In our model, all variants and versions of a text that incorporates amendments and updates to an earlier version are considered expressions of the same work.

●  Manifestation: the representation of an expression according to the requirements of a medium

●  Item: a single exemplar of a manifestation -  In our model, a manifestation is a representation of an

expression as an eletronic document in a specific format -  All copies of the same (identical) manifestation are

items. All items are accessible in a specific position on a specific computer.

Next: The IFLA FRBR hierarchy (3) 16/38

The IFLA FRBR hierarchy (2) ●  Work:

-  The play “Hamlet” by William Shakepeare -  The Italian act #3 (5 January 2001)

●  Expression: -  The first quarto of “Hamlet” (1601); -  the first folio of “Hamlet” (1623); -  the movie version of “Hamlet” by Kenneth Brannagh

(1996) -  The original version of Italian act 2001; -  the amended version of Italian act 3/2001 as of

19/12/2003

Next: The SGML components (1) 17/38

The IFLA FRBR hierarchy (3) ●  Manifestation:

-  One of the printed versions of the first folio version of “Hamlet” (e.g.: Penguin Books, 1994)

-  One of the computer versions of “Hamlet” (e.g., Project Gutemberg)

-  The NIR XML version of the amended version of Italian Act 3/2001 as of 19/12/2003

-  The printed version of the original version of Italian Act 3/2001 on the Italian Gazette #2 (2001)

●  Item: -  My own copy of “Hamlet” by Penguin Books; the copy

of “Hamlet” on the Gutemberg Project’s own site -  The copy of the NIR XML version of Italian Act 3/2001

on my computer. The one I copy on your computers.

Next: The SGML components (2) 18/38

The SGML components (1) ●  Content

-  What exactly was written in the document. -  The content is composed of words, punctuation,

sentences, images, paragraphs and so on. ●  Structure

-  How the content is organized -  All documents have an internal organization,

composed of subdivisions, hierarchies, preambles and conclusions, attachments, and so on. Within a paragraph, all parts that have a relevance (e.g. references, quotations, etc.)

●  Presentation -  The typographical choices to present a document on

screen or on paper.

Page 4: 01 - Intro to XML 1-1 · - Multi-channel publication (on paper and on the web), search, classification, identification Further activities- Consolidation, comparison, language synchronization,

9/9/09

4

Next: The SGML components (3) 19/38

The SGML components (2) ●  The structure adds meaning to pieces of content.

-  The text “Interpretation” assumes meaning once we know it is the title of article #2 of the Italian Act 3/2001

●  The structure connects the presentation to the content -  Once we know that the text “interpretation” is the title

of an article, we can apply the typographical choices associated to article titles.

●  The structure can be used to test the correctness of a document

-  We can deduce that a document is incorrect if there is no title associated to an article.

Next: The Semantic Web approach (1) 20/38

The SGML components (3) ●  The content itself can be categorized in categories:

-  Pure content, ●  appears in the document because it is instrumental to

the message conveyed by the document. For instance, the text “THE RETIREMENT BENEFITS AUTHORITY”

●  This is what we really are interested in -  Structural content

●  appears because it marks the beginning or the end of a structure. For instance, the text “Part II“

●  This can be used for deducing information about the structure

-  Presentation oriented content ●  Appears because it is dictated by the presentation

choices of the document. For instance, page numbers and repeating headers.

●  This can be safely ignored and thrown away.

Next: The semantic web approach (2) 21/38

The Semantic Web approach (1) ●  Data:

-  the actual text as was provided initially by the author of the document

●  Metadata: -  Any consideration or comment or additional

information that can be expressed on the content and on the document.

-  Metadata is generated either by human intervention, or through automated processes.

●  Ontology (in short): -  A representation of the conceptual model that shapes

all metadata associated to a document.

Next: Markup 22/38

The semantic web approach (2) ●  Authors’ contribution: data

-  The words and punctuation and breaks, exactly as have been written and accepted by the original author (with legislation, the legislative body)

●  Editors’ contribution: metadata -  Publication data. Lifecycle information. Footnotes.

Analysis of provisions. -  Metadata is useless unless it is provided following a

precise schema, called ontology. ●  In a way, editors are the authors of the metadata ●  Put it in another way, metadata is information

about a document that was not provided by its authors.

Next: XML Markup (1) 23/38

Markup

●  We call markup the additions to a written text that can let us use applications to work on the text:

-  Structural markup -  Descriptive markup -  Presentation markup

●  With XML, we add markup to the text of a document so that further applications can work on it.

●  XML uses a special syntax to add and distinguish text from markup

Next: XML Markup (2) 24/38

XML Markup (1)

●  XML markup clearly distinguishes elements, text (or #PCDATA) and attributes.

●  An element is contained within start tags and end tags, which are distinguishable through angle brackets:

-  <title>Interpretation.</title> ●  The content of an element can be

-  just text (simple text elements) -  Other elements (structural elements) -  A mix of text and other elements (mixed content

elements)

Page 5: 01 - Intro to XML 1-1 · - Multi-channel publication (on paper and on the web), search, classification, identification Further activities- Consolidation, comparison, language synchronization,

9/9/09

5

Next: Naming documents and fragments 25/38

XML Markup (2) ●  Within the start tag we can sometimes find attributes,

i.e. additional information about the element ●  <act contains="SingleVersion"> … </act>

-  A special attribute is “href”, that indicates the destination of a reference

●  As in <ref href=”#sec2”>section 2</ref> of this act.

-  Another special attribute is “id” that provides a reliable name for the element to be used in references

●  <clause id=”sec1-cla1"> … </clause> ●  <section id=“sec2”> … </section>

●  In a way, metadata is information about the document, while the attribute is information about the element

Next: Naming documents and fragments (2) 26/38

Naming documents and fragments

●  Uniform resource Identifiers -  These are used throughout the World Wide Web

to indicate resources. -  The best known are the URL (Uniform Resource

Locators) that are used to navigate on the web ●  http://www.akomantoso.org/09-examples

●  Fragment Identifiers -  Within a document, one can point to a specific

part of the etxt through the fragment identifier ●  http://www.akomantoso.org/09-examples#part3

-  This corresponds to an element whose attribute id is “part3”

Next: Markups and languages 27/38

Naming documents and fragments (2)

●  In our case the situation is more complex. Works, expressions and manifestations are not physical resources, but abstract entities.

●  Yet, references are rarely (or never) to items, but to those concepts

●  So works, expressions and manifestations must have their own URI, which is not a URL (i.e., it does not correspond to a physical address on a computer)

●  The act of finding out what is the URL of the item that best represents the manifestation that we are looking for is called URI resolution.

Markups and languages

Next: Structured and hierarchical markup 29/38

Procedural and descriptive markup

●  With procedural markup we precisely indicate the task to apply to each fragment of text in order to, say, display the document.

●  We indicate bold, italic, font name, font size, margins, etc.

●  Basically, the actual usage determines the markup inserted in the text

●  With descriptive markup we precisely identify the structural or semantic roles of each text fragment.

●  Rather than bold or font size, we indicate aspects such as heading, caption, quotation, paragraph, reference, etc.

●  Basically, since structural and semantic roles are independent of usage, I fill the document with persistent information.

Next: Markup meta-language 30/38

Structured and hierarchical markup

●  Markup can be used to identify and exploit structures, i.e. organization of content in connected fragments. It is possible to identify rules to define a concept of correctness of text.

●  Structures can be suggested (descriptive markup) or required (prescriptive markup). Documents are correct (valid) if they adhere to the rules specified.

●  Some structures can be hierarchical. Legislative documents are often a hierarchy of containers.

●  Capturing correctly the hierarchy of containment is an important characteristic for markup languages.

Page 6: 01 - Intro to XML 1-1 · - Multi-channel publication (on paper and on the web), search, classification, identification Further activities- Consolidation, comparison, language synchronization,

9/9/09

6

Next: Document Type Definitions and XML schema languages 31/38

Markup meta-language ●  A meta-language is a language to define languages, a

grammar to build new languages. ●  XML is not a markup language, but a language to used

to create markup languages. ●  XML does not provide suggestions on how to define

specific aspects of a document: bold or italic or reference or paragraph. Rather, it provides a grammar to provide such aspects can be defined in a new language.

Next: DTDs 32/38

Document Type Definitions and XML schema languages

●  The DTD or the XML Schema (XSD) are documents that describe an XML-based language.

●  They are the necessary step between the meta-language and language.

●  A schema document contains the list of allowed elements, attributes and repeatable document fragments (entities)

●  A schema document further contains the set of all constraints that all elements and attributes must undergo.

●  Constraints are expressed in terms of presence, repeatability and order.

Next: XML Schema 33/38

DTDs

●  DTD is the most basic validation language for XML documents.

-  A W3C standard. Indeed, part of the XML language definition itself.

-  Uses its own (odd) syntax -  Compact, easy to learn and manage -  Can stay with the XML document or be referred

to by the XML document -  Adequately expressive on structures, less so on

data content -  Universally known and used. All tools support it.

Next: Displaying XML documents: XSLT 34/38

XML Schema ●  XML Schema is another validation language:

-  Also a W3C standard, but independent of XML (and independently evolving… version 1.1 is to be standardized later this year)

-  uses XML-based syntax -  Much longer, precise, difficult to read and use -  Needs to stay outide of the XML document -  More precise both for structures and data content

●  You can require a date fragment to actually contain a valid date -  Also widely known, but fights against a number of

competitors, among which Relax NG, an ISO standard. -  Aimed at cross pollination between information

engineering and database structuring.

Next: Metadata and the Semantic Web 35/38

Displaying XML documents: XSLT

●  Displaying XML document is a downstream activity: it is very easy

●  XSLT (XML Style Language - Transformation) is used to generate displayable versions of XML documents.

●  XSLT is very flexible, and the same XML document can use many different XSLT stylesheets for different media and with different graphical layouts and typographical characteristics.

●  XSLT can be used for generating both on-line and on print versions of the same document.

Next: Next 36/38

Metadata and the Semantic Web

●  Traditional Web technologies have only dealt with display on-screen (and, partially, on paper).

●  Metadata are information stored about documents, and can be used for proper cataloguing, classification, search, sophisticated applications.

●  The Semantic Web -  RDF, OWL, Ontologies, Topic Maps, etc. -  connected initiatives to provide web applications

with the capabities to reason about, rather than just display, documents

Page 7: 01 - Intro to XML 1-1 · - Multi-channel publication (on paper and on the web), search, classification, identification Further activities- Consolidation, comparison, language synchronization,

9/9/09

7

Next: Conclusions 37/38

Next

●  After the break we shall discuss -  The syntax of DTDs -  The basic ideas of XML Schema -  The fundamental concepts of XSLT -  A few points about metadata, metadata schemas,

and ontologies

Fine presentazione 38/38

Conclusions ●  Markup languages are necessary for enriching data with

information about the usages and the applications that can use the data

●  Descriptive markup is the best starting point for the creation of new markup languages.

●  XML is best among markup languages for several reasons: -  It is a non proprietary, widely accepted standard -  It is structured, hierarchical, descriptive -  It allows both prescriptive and descriptive approaches -  Tools exist in all operating systems and computer

architectures.


Recommended