Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | kevon-foulger |
View: | 214 times |
Download: | 1 times |
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
XML for Information Management
University of Erlangen-NurembergComputational Linguistics
Instructor: Professor Airi Salminenhttp://users.jyu.fi/~airi/
12.1.-16.1. 2009
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
2
Day 4: Logical and Physical Structure of XML Documents
1. Components of the logical structure2. XML documents as trees3. Entity types4. Entity declarations and references5. XML processor treatment of entity
references6. Motivations for the use of entities
Outline
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
3
1. Components of the logical structure
• declarations
• elements
• comments
• processing instructions
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
4
1. Components of the logical structure
document ::= prolog element Misc*
declarationscommentsprocessing instructions
elementscommentsprocessing instructions
commentsprocessing instructions
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
5
‣ XML declaration [23]
‣ document type declaration [28]
‣ markup declaration [29]
• element type declaration [45]
• attribute list declaration [52]
• entity declaration [70]
• notation declaration [82]
‣ encoding declaration [80]
‣ standalone document declaration [32]
‣ text declaration [77]
Declarations:
1. Components of the logical structure
to constrain the logical structure
to constrain the physical structure
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
6
Typical element type declarations:
1. Components of the logical structure
mixed content defined
element content defined
<!ELEMENT product (mfg, model, description, clock?)><!ELEMENT model (#PCDATA)><!ELEMENT description (#PCDATA | feature)*><!ELEMENT clock EMPTY>
empty element defined
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
7
1. Components of the logical structure
empty element defined:
<clock></clock><clock/>
<!ELEMENT clock EMPTY>
two forms of the element allowed in a well-formed document:
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
8
1. Components of the logical structure
element content: definition by content models with metasymbols
* iteration (none or more)+ iteration (once or more)| alternatives? optional, successive( ) grouping
#PCDATA is not accepted in the content model!
<!ELEMENT table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))>
Example from XHTML 1.0 Strict DTD:
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
9
1. Components of the logical structure
mixed content: definition has basically two forms
(#PCDATA)(#PCDATA | e1 | … | en)*
<!ELEMENT text (#PCDATA)><!ELEMENT section (#PCDATA | subsection)*><!ELEMENT section (#PCDATA | subsection | paragraph)*>
#PCDATA is always included in the content specification and comes first in the list of alternatives
examples:
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
10
• to define the set of attributes pertaining to a given elemen type
• to establish type constraints for these attributes
• to provide default values for attributes
Attribute list declarations
1. Components of the logical structure
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
11
attribute name
<!ATTLIST poem author CDATA #REQUIRED >
attribute type: string
constraint: the attribute must be specified for all elements of type poem
element type
1. Components of the logical structure
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
12
Defining constraints
#REQUIRED: attribute must always be provided in all elements of the given type
#IMPLIED: attribute can be provided in a element; no default value is provided
AttValue: default value is given between single or double quotes
#FIXED AttValue: instances of the attribute must match the given default value
[60] DefaultDecl ::= '#REQUIRED' | '#IMPLIED'| (('#FIXED' S) ? AttValue)
1. Components of the logical structure
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
13
Attribute types
[54] AttType ::= StringType | TokenizedType | EnumeratedType
• ENTITY, ENTITIES: entity names
• NMTOKEN, NMTOKENS: text tokens consisting of characters accepted in names
• ID: names that uniquely identify elements
• IDREF, IDREFS: references to ID type identifiers
tokenized types:
enumerated types:• NOTATION, NOTATIONS: identify notations• enumeration
1. Components of the logical structure
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
14
<?xml version=”1.0”?><!DOCTYPE text [<!ELEMENT text (line+)><!ELEMENT line (#PCDATA)><!ATTLIST line
id ID #REQUIREDseeline IDREFS #IMPLIED> ]>
<text><line id=”r1”>This is the first line</line><line id=”r2” seeline=”r1”>This is the second line, but look at the first too</line></text>
1. Components of the logical structure
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
15
2. XML documents as trees
<Chapter section = '1' ><Narration narrator='Benjy'><Imagery place='tree' mode=simile sense='smell'><Fragment code='1.12'><Paragraph id='143'><Subject person='Caddy'>She</Subject>smelled like trees.</Paragraph></Fragment></Imagery></Narration></Chapter>
XML-aware web browsers support the visualization of the hierarchic structure: example
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
16
2. XML documents as trees
XML specification defines a concrete syntax for XML documents.
W3C has defined four slightly different abstract models to decribe the abstract syntax of XML documents:
• XML Information Set• DOM model• XPath 1.0 model• XQuery 1.0 and XPath 2.0 data model
Analysis of differences in the models: Salminen, A., & Tompa, F.W. (2001). Requirements for XML document database systems. Proc. of the ACM Symposium on Document Engineering (DocEng '01) (pp. 85-94). New York: ACM Press.
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
17
<poem author = ”Murasaki Shikibu” born = ”974”><!-- The poem is translated from Japanese by Kenneth Rexroth --><line>This life of ours would not cause you sorrow</line><line>if you thought of it as like</line><line>the mountain cherry blossoms</line><line>which bloom and fade in a day. </line></poem>
2. XML documents as trees
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
18
poem
line
line
lineAuthorMurasaki Shikibu
line
born 974
This life of ours would not cause you sorrow
if you thought of it as like
which bloom and fade in a day.
the mountain cherry blossoms
Root node
Element node
Attribute node
The poem is translated from Japanese by Kenneth Rexroth
Text node
Comment node
poem
2. XML documents as trees
Node types of XPath 1.0
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
19
3. Entity types
Physical structure of XML documents consists of entities.
An entity is a unit recognized by the XML processor, the content of an entity is text or other kind of data.
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
20
parsed entities -- unparsed entities
internal entities -- external entities
general entities -- parameter entities
3-dimensional categorization:
3. Entity types
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
21
parsed entity
intended to be parsed by the XML processor, content consists of marked-up text
unparsed entity
not intended to be parsed by the XML processor, content can be whatever data
3. Entity types
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
22
internal entity
name and value given in an entity declaration
always a parsed entity
external entity
not internal
parsed or unparsed
3. Entity types
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
23
general entity
used in elements and attributes
parsed or unparsed
internal or external
parameter entity
used in the document type definition
always parsed
internal or external
3. Entity types
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
24
Alternatives
parsed internal parameter
internal general
external parameter
internal general
unparsed external general
3. Entity types
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
25
• root entity, external subset of DTD
• other files intended for XML processing
INPUT FILES for XML processing:
UNPARSED ENTITIES:
XMLprocessor
Information about: application
• elements and attributes
• comments• processing instructions• character data• namespaces• notations and
locations of unparsed entities
• files not intended for XML processing but referred to by entity references in the INPUT FILES
INTERNAL ENTITIES:
• name and textual content given in DTD
3. Entity types
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
26
4. Entity declarations and references
EntityDecl ::= GEDecl | PEDecl
GEDecl ::= '<!ENTITY' S Name S EntityDef S? '>'
PEDecl ::= '<!ENTITY' S '%' Name S PEDef S? '>'
EntityDef ::= EntityValue | ( ExternalID NDataDecl?)
PEDef ::= EntityValue | ExternalID
entity definition for external entityentity definition for internal entity
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
27
internal entity
name and value ( = literal value) given
<!ENTITY % Shape "(rect | circle | poly | default )">
<!ENTITY JY "Jyväskylän yliopisto">
name literal value
4. Entity declarations and references
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
28
name and system identifier (possibly together with public identifier) given, for an unparsed entity also notation
external entity
<!ENTITY % HTMLsymbol PUBLIC "-//W3C//ENTITIES Symbols for XHTML//EN"
"xhtml-symbol.ent"><!ENTITY % HTMLspecial PUBLIC "-//W3C//ENTITIES Special for XHTML//EN"
"xhtml-special.ent">http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html
Declarations from XHTML specification:
<!ENTITY virtuaaliyliopistouutiset SYSTEM "http://virtuaaliyliopisto.jyu.fi/kotisivut/sisalto/etusivu/newsfeed.xml">
4. Entity declarations and references
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
29
Unparsed entity
notation name
The notation must have been declared, for example:
<!ENTITY image1 SYSTEM "../images/birdnest.gif” NDATA gif>
4. Entity declarations and references
<!NOTATION gif PUBLIC "-//ISBN 0-7923-9432-1::Graphic Notation//NOTATION CompuServe Graphic Interchange Format//EN" >
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
30
References to parameter entities:
%Shape;
&JY;
%HTMLsymbol;
&virtuaaliyliopistouutiset;
References to parsed general entities:
Reference to an unparsed general entity:
<poem image="image1">
The type of the attribute has to be ENTITY or ENTITIES
4. Entity declarations and references
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
31
In addition to entity references, XML documents may contain character references.
Refers to a specific character of Unicode
Provides a decimal or hexadecimal representation of the character’s code point in Unicode
"Example:
One-character entity defined: <!ENTITY quot """>
4. Entity declarations and references
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
32
Where an entity or character reference can occur?
reference to
can occur inparameter entity ‣document type definition
parsed general entity ‣element content‣attribute value (either in the start-
tag or in the attribute definition)‣entity value
unparsed general entity ‣attribute value (either in the start-tag or in the attribute definition)
character ‣element content‣attribute value (either in the start-
tag or in the attribute definition)‣entity value
4. Entity declarations and references
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
33
5. XML processor treatment of entity references
References to unparsed entities
Validating processor makes the identifiers for the entities and associated notations available to the application.
<poem image=”figure1"><!-- From a poem of Aale Tynni --><line>Seisoin ikkunassa ja nauroin. Ihana puu.</line><line>Ihana pesä.</line></poem>
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
34
References to parsed entities
Dealing with two kinds of entity values:
literal value - the character string written between quotes in the entity definition
replacement text - derived by replacing the character references and parameter entity references in the literal value by their character values and replacement texts, respectively.
The XML processor replaces the entity reference by its replacement text.
5. XML processor treatment of entity references
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
35
<!ENTITY rhyme1 "<rhyme xml:lang="fi"><line>Ole aina iloinen</line><line>niin kuin pikku varpunen</line></rhyme>">
replacement text = literal value
entity declaration
entity reference <rhymecollection>&rhyme1; </rhymecollection>
<rhyme xml:lang="fi"><line>Ole aina iloinen</line><line>niin kuin pikku varpunen</line></rhyme>
5. XML processor treatment of entity references
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
36
<!ENTITY % StyleSheet ”CDATA”> <!-- style sheet data -->
<!ENTITY % Text ”CDATA”> <!-- used for titles etc. -->
<!ENTITY % coreattrs ”id ID #IMPLIED class CDATA #IMPLIED
style %StyleSheet; #IMPLIED title %Text; #IMPLIED”>
http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html
Declarations from XHTML specification:
literal value of coreattrs: id ID #IMPLIED class CDATA #IMPLIED
style %StyleSheet; #IMPLIED title %Text; #IMPLIED
replacement text of coreattrs: id ID #IMPLIED class CDATA #IMPLIED
style CDATA #IMPLIED title CDATA #IMPLIED
5. XML processor treatment of entity references
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
37
<!ENTITY % Block ”(%block; | form | %misc; )*”>
Exercise 10 (Course Text, Chapter 5)
Entity declaration from XHTML Strict-DTD:
What is the (a) literal value(b) replacement text
of entity Block
(a) literal value: (%block; | form | %misc; )*
5. XML processor treatment of entity references
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
38
<!ENTITY % heading ”h1| h2| h3| h4| h5| h6”><!ENTITY % lists ”ul | ol | dl”><!ENTITY % blocktext ”pre | hr | blockquote | address”><!ENTITY % block ”p | %heading; | div | %lists; | %blocktext; | fieldset | table”><!ENTITY % misc.inline ”ins | del | script”><!ENTITY % misc ”noscript | %misc.inline;”>
http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html
Declarations from XHTML specification:
Other entity declarations needed from the DTD:
5. XML processor treatment of entity references
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
39
Deriving the replacement text of Block : references to parameter entities in the literal value (%block; | form | %misc;)* replaced by their replacement texts.
p | %heading; | div | %lists; | %blocktext; | fieldset | table
Literal value of block:
Replacement text of block:p | h1| h2| h3| h4| h5| h6 | div | ul | ol | dl | pre | hr | blockquote | address | fieldset | table
Literal value of misc : noscript | %misc.inline;
Replacement text of misc : noscript | ins | del | scriptReplacement text of Block : (p | h1| h2| h3| h4| h5| h6 | div | ul | ol | dl | pre | hr | blockquote |
address | fieldset | table | form | noscript | ins | del | script )*
5. XML processor treatment of entity references
XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen
40
6. Motivations for the use of entities
• use of non-textual data (audio, graphics, etc.) in XML documents (but can be added also in stylesheets)
• modularization of documents
• consistency
• multiuse of definitions
• adding semantic information by informative entity names and comments attached to entity declarations
The use of entities supports: