+ All Categories
Home > Documents > XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT,...

XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT,...

Date post: 17-Aug-2020
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
76
XML Concepts Prof. Andrea Omicini DEIS, Ingegneria Due Alma Mater Studiorum, Università di Bologna a Cesena [email protected] 1
Transcript
Page 1: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

XML Concepts

Prof Andrea OmiciniDEIS Ingegneria Due

Alma Mater Studiorum Universitagrave di Bologna a Cesenaandreaomiciniuniboit

1

Introducing XMLXML FundamentalsDocument Types Definitions (DTDs)NamespacesInternationalisationXML amp CSSDOM amp SAX

Outline

2

Introducing XML

3

What is XMLA W3C Standard

httpwwww3orgXMLA mark-up language for text documents

derived from SGML (Standard General Markup Language)

ISO 8879 httpwwwisochcated16387html

eXtensible Markup LanguageA meta-markup language

to define markup languagessuch as XHTML XSLT XML Schemahellip

A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies

4

What XML is not

XML is nota programming languagea network-transport protocola document presentation languagea database (manager)

It can be used (and it is actually) in all of those contexts but it remains a markup language

5

Why Markup Markup

encoding embodied in the document specifying document properties as well as properties of information contained

for instance formatting instructionsmore generally structural semantic information

knowledge vs dataMarks Markups

tag used to qualify label text chunkseg HTML tags

XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt

6

XML X for Basic idea of XML

a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages

ThenXML is quite free in generalit can be ldquoextended

actually specialisedto define more specific ad hoc markup languages

No predefined XML markups as it happens instead in HTML

they need to be defined7

Hey too many Application domains are more and more

numerouscomplexspecific

Special specialised languages as the engineers tools

to represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

8

XML ApplicationsXML per se is ldquosmallrdquo amp simple

languages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like

12

How to Work with

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML It can be

A text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

14

How does XML Who handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets

15

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 2: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Introducing XMLXML FundamentalsDocument Types Definitions (DTDs)NamespacesInternationalisationXML amp CSSDOM amp SAX

Outline

2

Introducing XML

3

What is XMLA W3C Standard

httpwwww3orgXMLA mark-up language for text documents

derived from SGML (Standard General Markup Language)

ISO 8879 httpwwwisochcated16387html

eXtensible Markup LanguageA meta-markup language

to define markup languagessuch as XHTML XSLT XML Schemahellip

A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies

4

What XML is not

XML is nota programming languagea network-transport protocola document presentation languagea database (manager)

It can be used (and it is actually) in all of those contexts but it remains a markup language

5

Why Markup Markup

encoding embodied in the document specifying document properties as well as properties of information contained

for instance formatting instructionsmore generally structural semantic information

knowledge vs dataMarks Markups

tag used to qualify label text chunkseg HTML tags

XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt

6

XML X for Basic idea of XML

a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages

ThenXML is quite free in generalit can be ldquoextended

actually specialisedto define more specific ad hoc markup languages

No predefined XML markups as it happens instead in HTML

they need to be defined7

Hey too many Application domains are more and more

numerouscomplexspecific

Special specialised languages as the engineers tools

to represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

8

XML ApplicationsXML per se is ldquosmallrdquo amp simple

languages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like

12

How to Work with

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML It can be

A text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

14

How does XML Who handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets

15

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 3: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Introducing XML

3

What is XMLA W3C Standard

httpwwww3orgXMLA mark-up language for text documents

derived from SGML (Standard General Markup Language)

ISO 8879 httpwwwisochcated16387html

eXtensible Markup LanguageA meta-markup language

to define markup languagessuch as XHTML XSLT XML Schemahellip

A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies

4

What XML is not

XML is nota programming languagea network-transport protocola document presentation languagea database (manager)

It can be used (and it is actually) in all of those contexts but it remains a markup language

5

Why Markup Markup

encoding embodied in the document specifying document properties as well as properties of information contained

for instance formatting instructionsmore generally structural semantic information

knowledge vs dataMarks Markups

tag used to qualify label text chunkseg HTML tags

XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt

6

XML X for Basic idea of XML

a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages

ThenXML is quite free in generalit can be ldquoextended

actually specialisedto define more specific ad hoc markup languages

No predefined XML markups as it happens instead in HTML

they need to be defined7

Hey too many Application domains are more and more

numerouscomplexspecific

Special specialised languages as the engineers tools

to represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

8

XML ApplicationsXML per se is ldquosmallrdquo amp simple

languages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like

12

How to Work with

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML It can be

A text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

14

How does XML Who handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets

15

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 4: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

What is XMLA W3C Standard

httpwwww3orgXMLA mark-up language for text documents

derived from SGML (Standard General Markup Language)

ISO 8879 httpwwwisochcated16387html

eXtensible Markup LanguageA meta-markup language

to define markup languagessuch as XHTML XSLT XML Schemahellip

A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies

4

What XML is not

XML is nota programming languagea network-transport protocola document presentation languagea database (manager)

It can be used (and it is actually) in all of those contexts but it remains a markup language

5

Why Markup Markup

encoding embodied in the document specifying document properties as well as properties of information contained

for instance formatting instructionsmore generally structural semantic information

knowledge vs dataMarks Markups

tag used to qualify label text chunkseg HTML tags

XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt

6

XML X for Basic idea of XML

a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages

ThenXML is quite free in generalit can be ldquoextended

actually specialisedto define more specific ad hoc markup languages

No predefined XML markups as it happens instead in HTML

they need to be defined7

Hey too many Application domains are more and more

numerouscomplexspecific

Special specialised languages as the engineers tools

to represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

8

XML ApplicationsXML per se is ldquosmallrdquo amp simple

languages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like

12

How to Work with

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML It can be

A text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

14

How does XML Who handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets

15

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 5: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

What XML is not

XML is nota programming languagea network-transport protocola document presentation languagea database (manager)

It can be used (and it is actually) in all of those contexts but it remains a markup language

5

Why Markup Markup

encoding embodied in the document specifying document properties as well as properties of information contained

for instance formatting instructionsmore generally structural semantic information

knowledge vs dataMarks Markups

tag used to qualify label text chunkseg HTML tags

XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt

6

XML X for Basic idea of XML

a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages

ThenXML is quite free in generalit can be ldquoextended

actually specialisedto define more specific ad hoc markup languages

No predefined XML markups as it happens instead in HTML

they need to be defined7

Hey too many Application domains are more and more

numerouscomplexspecific

Special specialised languages as the engineers tools

to represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

8

XML ApplicationsXML per se is ldquosmallrdquo amp simple

languages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like

12

How to Work with

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML It can be

A text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

14

How does XML Who handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets

15

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 6: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Why Markup Markup

encoding embodied in the document specifying document properties as well as properties of information contained

for instance formatting instructionsmore generally structural semantic information

knowledge vs dataMarks Markups

tag used to qualify label text chunkseg HTML tags

XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt

6

XML X for Basic idea of XML

a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages

ThenXML is quite free in generalit can be ldquoextended

actually specialisedto define more specific ad hoc markup languages

No predefined XML markups as it happens instead in HTML

they need to be defined7

Hey too many Application domains are more and more

numerouscomplexspecific

Special specialised languages as the engineers tools

to represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

8

XML ApplicationsXML per se is ldquosmallrdquo amp simple

languages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like

12

How to Work with

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML It can be

A text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

14

How does XML Who handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets

15

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 7: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

XML X for Basic idea of XML

a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages

ThenXML is quite free in generalit can be ldquoextended

actually specialisedto define more specific ad hoc markup languages

No predefined XML markups as it happens instead in HTML

they need to be defined7

Hey too many Application domains are more and more

numerouscomplexspecific

Special specialised languages as the engineers tools

to represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

8

XML ApplicationsXML per se is ldquosmallrdquo amp simple

languages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like

12

How to Work with

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML It can be

A text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

14

How does XML Who handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets

15

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 8: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Hey too many Application domains are more and more

numerouscomplexspecific

Special specialised languages as the engineers tools

to represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

8

XML ApplicationsXML per se is ldquosmallrdquo amp simple

languages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like

12

How to Work with

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML It can be

A text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

14

How does XML Who handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets

15

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 9: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

XML ApplicationsXML per se is ldquosmallrdquo amp simple

languages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like

12

How to Work with

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML It can be

A text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

14

How does XML Who handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets

15

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 10: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

XML for Portable

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like

12

How to Work with

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML It can be

A text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

14

How does XML Who handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets

15

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 11: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like

12

How to Work with

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML It can be

A text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

14

How does XML Who handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets

15

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 12: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

How XML Looks like

12

How to Work with

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML It can be

A text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

14

How does XML Who handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets

15

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 13: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

How to Work with

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML It can be

A text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

14

How does XML Who handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets

15

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 14: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

What is an XML It can be

A text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

14

How does XML Who handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets

15

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 15: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

How does XML Who handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets

15

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 16: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Where is XML

Everywhere already

16

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 17: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Some History of XML Lot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGML

but had obvious limitationstoo complex

more than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 18: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

XML Fundamentals

18

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 19: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

A Simple XML

ltplayergt Carlo Nervoltplayergt

19

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 20: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

XML Document amp

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 21: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

XML Elements amp

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervo

tags are markupthe most common form of markup but there are other kinds

content is character dataincluding the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 22: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 23: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 24: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed

24

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 25: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Narrative-Organised ltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 26: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

XML AttributesElements can be labelled by attributes

attributes are specified in the start tagand in the only tag of empty elements

any number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=value

alternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML document

but they are qualifiers for the nodes and

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 27: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Using Elements or

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the two

Element-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 28: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore

28

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 29: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Parsed Character An XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsed

unless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 30: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 31: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Pre-defined XML

Markup Entity Description

amplt lt less-then

ampgt gt grater-than

ampamp amp ampersand

ampquot double quote

ampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 32: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

CDATA Sections

Including code chunks from any language with lt or can be tedious

we need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 33: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

CommentsEasy

lt-- Comment --gtIt cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 34: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

XML Processing Need to pass information for a given application through the parser

comments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an element

it can appear everywhere out of a tag even before or after the root

34

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 35: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

The XML DeclarationLooks like an XML processing instruction

but it is not just the XML declarationIt is optional

but if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogtVersion is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 36: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Checking Well-Main rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 37: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

DTD

37

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 38: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Flexibility or

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one team

Document Type Definition (DTD)to define which XML documents are valid

Validity is not mandatory as well-formednesshow to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 39: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declaration

then the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 40: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

DTD ishellipSGML-based

syntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documents

typical syntax-based approachmaybe limited but easy to implement

Maybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 41: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

A Simple DTD

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 42: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 43: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

DTD Declarations

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone else

like a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 44: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 45: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 46: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Attribute

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 47: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 48: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute value

NMTOKEN NMTOKENSmore than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFS

48

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 49: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Other DTD

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 50: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Namespaces

50

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 51: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

What are Distinguish

different XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 52: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Syntax for

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 53: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Associating Prefixes Example

a large firm could have a number of namespaces for different purposes

ltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not

53

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 54: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Setting Default

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 55: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Internationalisation

55

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 56: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

What does Text ldquoTextrdquo can be encoded according so many different alphabets

mapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodings

UTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 57: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 58: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Multi-Lingual Example a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tags

prefix x-58

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 59: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Encoding for Working around encoding is not simply an ldquointernationalisationrdquo issue

it is also about portabilityWhen transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portability

across platforms across applications across time

59

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 60: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

XML amp CSS

60

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 61: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 62: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Cascading Style Cascading Style Sheets (CSS)

a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basics

if not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 63: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 64: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tags

nometag attributo1 valore1 hellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 65: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 66: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Example How Mozilla Visualises it

66

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 67: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Example How Mozilla Visualises it

67

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 68: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

DOM amp SAX

68

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 69: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sorts

through ad hoc APIThe most used hated deprecated widespread are

69

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 70: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Document Object httpwwww3orgDOM

standard W3C as usualThe Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 71: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998

primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000

adds Namespace support and minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 72: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can contain

document doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 73: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Properties amp Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Java73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 74: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 75: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Main Problem of

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 76: XML Conceptslia.deis.unibo.it/corsi/2006-2007/SD-LA/slides/4-xml-blank.pdf · such as XHTML, XSLT, XML Schema… A formally-defined text-based language verifiable for well-formedness

Simple API for XML Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76


Recommended