IBM - Develop Python/XML with 4Suite, Part 1: …Develop Python/XML with 4Suite, Part 1: Process XML...

Develop Python/XML with 4Suite, Part 1: ProcessXML with PyXmlExplore the PyXml implementation of DOM Level 2

Skill Level: Introductory

Uche Ogbuji ([email protected])Principal consultantFourthought Inc.

Chimezie Ogbuji ([email protected])Software consultantFourthought Inc.

17 Oct 2001

The first in a series, this tutorial covers PyXml, an implementation of the W3C's DOMLevel 2 specification contained in Fourthought's 4Suite. 4Suite is an open source,comprehensive library and toolkit for XML processing in Python, and it implementsvarious open standards related to XML. This series of tutorials introduces 4Suite andgives practical examples of XML development using 4Suite.

Section 1. Introduction

Who should read this tutorial?

Many XML related technologies are no more than specifications for a standard wayfor information to be processed and interpreted. This is no different with theDocument Object Model (DOM), which provides a standard way for users to accessand manipulate an XML document. There are various DOM implementations, invarious computer languages. Here, we will be introducing Fourthought's PyXml. Thisparticular implementation is written in Python.

This tutorial is written for users who are familiar with XML/DOM and Python, withmore emphasis on the latter. Since DOM is no more than an ApplicationProgramming Interface (API), knowledge of Python is more important than

Process XML with PyXml© Copyright IBM Corporation 1994, 2008. All rights reserved. Page 1 of 28

mailto:[email protected]

mailto:[email protected]

http://www.ibm.com/legal/copytrade.shtml

knowledge of XML to follow this tutorial. Minimal knowledge of XML (specificallyDOM) is also necessary, since the focus is on a DOM tool, not the DOMspecification itself. You should keep referential material within reach, and you willfind it very useful to have a copy of the DOM specification open. Appropriate linksand sources are listed in the Resources section at the end of the tutorial.

Required software

The following packages are required to follow this tutorial:

• Python 2.1: Python is the object oriented language that PyXml is writtenin.

• PyXml: PyXml is one of the many DOM implementations. This one iswritten in Python and is packaged as a Python library.

In addition, you should read the DOM Level 2 Specification, which underlies thistutorial to a large extent.

Getting help and finding out more

For technical questions about the content of this tutorial, contact the authors, UcheOgbuji ([email protected]) and Chimezie Ogbuji([email protected]).

Section 2. Basic intro to DOM and basic PyXmlmanipulations

DOM: The specification

Since PyXml is one of many implementations of DOM, this tutorial begins with anintroduction to the DOM specification. Going from having a general knowledge ofDOM to learning how to use PyXml should be a relatively small step.

What is the DOM?

DOM is essentially a platform-independent API for HTML and XML documents. Itdefines a common means to access and manipulate these documents. A developercan use the DOM API to delete, remove, and update most of the parts that make upan HTML or XML document.

developerWorks® ibm.com/developerWorks

Process XML with PyXmlPage 2 of 28 © Copyright IBM Corporation 1994, 2008. All rights reserved.

http://www.python.org/download

http://sourceforge.net/projects/pyxml

http://www.w3.org/DOM/

mailto:[email protected][email protected]

mailto:[email protected][email protected]


Platform-independent manipulation of XML/HTML is one of the most important goalsof the DOM specification. This encourages multiple implementations in variouslanguages and makes it easier for a user to switch between DOM implementationswith little impact.

DOM interfaces

The DOM API consists of eight interfaces:

1. DOM Core

2. DOM HTML

3. DOM Views

4. DOM StyleSheets

5. DOM CSS

6. DOM Events

7. DOM Traversal

8. DOM Range

This tutorial will cover only the interfaces that have been highlighted. This limitationhas nothing to do with the specification itself, but with Fourthought's implementation.PyXml, like most open-source products, has been shaped by the Python/XMLcommunity that uses PyXml; users have found the "core" interfaces sufficient formost document processing needs.

Nodes and the DOM tree

Figure 1. Visual illustration of the piece of HTML below

ibm.com/developerWorks developerWorks®



The best introduction to the various components of a DOM tree is a simple HTMLexample.

The diagram in Figure 1 is a visual illustration of the piece of HTML below. Thetree-like structure is the most commonly used conceptual tool for data structures likeHTML. HTML tags are organized in hierarchical fashion, and the "nodes" of thevisual tree represent these tags and how they are organized.

<xml-listing><table>

<tr><td>Unordered List:<ul>

<li>List Item 1</li><li>List Item 2</li>

</ul></td>

</tr></table></xml-listing> 

Overview of the DOM Core Interfaces

This section covers the first set of interfaces in DOM: the DOM Core Interfaces.

DOM represents a document as a hierarchy (a tree, in particular) of Node objects.Node objects are abstract units of information that contribute to the meaning (or




semantics) of the entire document by how they are organized with respect toneighboring Node objects. Nodes have a simple parental relationship. A node has amandatory parent (unless it is the root of the tree), and optional children (anunlimited amount). Most readers with a basic understanding of computer softwaredata structure should find this metaphor very familiar.

DOM defines various types of nodes, but all the nodes share a common interface(the Node interface). These nodes vary in the number and kind of children they canhold. They also define unique interfaces for their specific functionality. Here are the12 types of nodes defined in the current DOM Level 2 specification (of whichElement is the most common):

• Document

• DocumentFragment

• DocumentType

• EntityReference

• Element

• Attr

• ProcessingInstruction

• Comment

• Text

• CDATASection

• Entity

• Notation

The "tree of nodes" approach

The Node interface is sufficient for handling most types of nodes. This provides theuser with an easier conception of a DOM tree.

A DOM document places certain restraints on how nodes relate to each other. Forexample, a document element typically consists of a parent, child elements, andattribute nodes. The Element interfaces can be used to retrieve element attributesand child elements. Primarily, thedoc.documentElement.getElementsByTagNameNS anddoc.documentElement.getElementsByTagName interfaces can be used toretrieve child elements by name.

However, by viewing a DOM hierarchy as a homogeneous tree of nodes, a user canintuitively manipulate XML/HTML documents. It also allows direct access to objectinstances without expensive object-oriented introspection and class casting.




Below is a list of the interfaces that define a Node object. The insertBefore,appendChild, and replaceChild interfaces are the primary means of insertingnodes into a tree. removeChild (as the name suggests) removes a specified nodefrom a tree. For more detailed descriptions of each interface discussed in thistutorial, readers should refer to the DOM Level 2 Specification, listed in Resources.

Node

• insertBefore(Node newChild, Node refChild)

• replaceChild(Node newChild, Node oldChild)

• removeChild(Node oldChild)

• appendChild(Node newChild)

• hasChildNodes()

• cloneNode(boolean deep)

• normalize()

• supports(DOMString feature, DOMString version)

Section 3. Reading DOM from an XML file

Introduction to DOM parsing

This section focuses on the details of parsing XML to create a DOM document. TheDOM interfaces deal with an existing XML document. Typically, XML documents areparsed from XML text, not created (using the DOM interfaces provided for building aDOM tree).

We will give a brief overview of PyXml's parsing tools. All code snippets arecut-and-paste screenshots of a session with a Python interpreter. You should enterlines preceded by the Python interpreter prompt (>>>) directly into the Pythonintepreter, which should generate the appropriate response.

XmlDomGenerator: Heart of PyXml's XML parsing

The XmlDomGenerator class (of the xml.dom.ext.reader.Sax2 module) is solelyresponsible for handling Simple API for XML (SAX) events generated by the XMLparser as it comes across recognizable portions of the XML text being parsed. SAXis another standard defined for the handling and parsing of XML resources. ThePython Library Reference on Python.org provides documentation that outlines SAXevent handler. This class knows how to respond to SAX events by calling the




appropriate DOM interfaces to create a DOM tree from an XML source.

It does this by registering itself to a SAX parser as being able to handle various SAXevents, which are triggered as an XML input stream is parsed. The Python librarydocumentation discusses Python XML parsing using Expat, the default parser usedby PyXml for parsing XML (see Resources).

The Readers and the three DOM implementations

In PyXml, parsing is handled by a Reader class. PyXml provides its own Readerclass which facilitates the parsing of XML into a PyXml document instance. 4Suitecomes packaged with a streamlined implementation of DOM (called pDomlette)which has its own Reader. This Reader can be used instead of PyXml's. However,for the purpose of this tutorial we will use PyXml's Reader implementation togenerate DOM nodes.

The Reader class defines the following functions for parsing XML:

• fromStream

• fromString

• fromUri

PyXml's reader is located in the xml.dom.ext.reader.Sax2 module.

The main parsing functions (continued)

The fromStream function returns a PyXml document instance representing theparsed XML stream. This function is directly called by the other two Readerfunctions which pass it XML streams from different sources (XML text and URIaddressed resources). All three functions take a first argument which is the XMLsource (and is a Python stream, XML text, or a URI string, depending on thefunction) and a second, optional argument ownerDocument: an existing XMLdocument to use for instantiating DOM nodes. This defaults to a new PyXmldocument instance if not specified.

Example: Creating a document from XML text

Here, open up a Python session and work on an example that illustrates how tocreate a document from text. You'll use the fromString interface on a PyXmlReader to create an XML document instance (named "doc") from text.

>>>from xml.dom.ext.reader.Sax2 import Reader>>>PyXMLReader = Reader()>>>xmlText="""<docRoot>




<docElement name='1'><docElement name='2'>Text</docElement>

</docElement></docRoot>"""

>>>doc=PyXMLReader.fromString(xmlText)

>>>doc<XML Document at xxxxxx>

Section 4. PyXml demo of the Node and Elementinterfaces

You can use several ways to navigate the tree:

• By the Node interface

• By the Element interface

Navigating the PyXml tree

There are several ways to navigate the tree. This section describes two ways: by theNode interface and by the Element interface. In the code listing below, you'llbrowse a DOM tree using the childNodes interface (defined specifically for nodes).In particular, you'll look at the document root node and its children.

Please note that all the code listings should be run in a single Python interpreter,since the examples build on one another. In the case below, you'll try out severalDOM interfaces on the XML document instance you created earlier.

>>> doc.childNodes<NodeList at xxx

[<DocumentType Node at xxx: Name = 'docRoot' with 0 entities and 0 notations>,<Element Node at xxx: Name = 'docRoot' with 0 attributes and 3 children>]>

>>> doc.documentElement<Element Node at xxx: Name = 'docRoot' with 0 attributes and 3 children>

>>> doc.documentElement.childNodes<NodeList at xxx: [<Text Node at xxx: data = '\0xa '>,<Element Node at xxx: Name = 'element' with 1 attributes and 3 children>,<Text Node at xxx: data = '\0xa '>]>

It's important to note the nodes listed under the document element other than thedocElement element. Unless directed otherwise (by a DTD), the SAX parserinstantiates all text between child elements as text nodes.

Figure 2 is a visual depiction of the DOM tree that this example instantiated from




text.

Figure 2. Visual depiction of the DOM tree that this example instantiated fromtext

Navigating the PyXml tree (continued)

Next, you'll retrieve the first element of the root node and take a look at its only child,a text node with a value of "Text".

>>> nodeList=doc.documentElement.childNodes>>> elem1=nodeList[1]>>> elem1<Element Node at xxx: Name = 'element' with 1 attributes and 3 children>

>>> dummy1,elem2,dummy2=elem1.childNodes>>> elem2.childNodes<NodeList at xxx: [<Text Node at e6cdbc: data = 'Text'>]>>>> textNode=elem2.firstChild>>> textNode<Text Node at xxx: data = 'Text'>

Navigating attributes

A DOM element (the most common DOM node) has attributes as well as parent andchild elements. These can be navigated through the "attributes" attribute (no punintended) defined by the Node interface. This attribute returns a NamedNodeMap(another structure/class defined in the DOM specification), which is a dictionary of allthe attribute nodes belonging to the element in question.

The code listing below demonstrates how the getElementsByTagNameNS function(defined for an Element) can also be used to retrieve elements in a DOM document




by name.

The Element.getElementsByTagNameNS interface returns elements matching aspecified, qualified name recursively down the DOM tree.

>>> doc.documentElement.getElementsByTagNameNS('','docElement')<NodeList at xxx:

[<Element Node at xxx: Name = 'docElement' with 1 attributes and 3 children>,<Element Node at xxx: Name = 'docElement' with 1 attributes and 1 children>]>>>> elem1,elem2=doc.documentElement.getElementsByTagNameNS('','docElement')>>> elem1<Element Node at xxx: Name = 'docElement' with 1 attributes and 3 children>>>> elem1.attributes<NamedNodeMap at xxx:

{('', u'name'): <Attribute Node at xxx: Name = "name", Value = "1">}>>>> elem1.getAttributeNS('','name')u'1'

Section 5. Basic DOM manipulation

Now we demonstrate several ways to manipulate the DOM tree, including:

• Construct of a DOM hierarchy using Node interfaces

• Create a new document instances

• Attach new document elements

• Add and modify attributes: through an element's NamedNodeMap

• Add and modify attributes using the Element interfaces

Creating a tree of DOM nodes

Here, you'll walk through the construction of a DOM hierarchy using Node interfaces.You'll create a new document instance using the DOMImplementation interfaceand attach two new document elements to it: fb:officeFloor and fb:name. Thelatter element will have a text node attached to it.

The DOMImplementation interfaces provide methods for performing operationsindependent of an existing document instance.

>>> from xml.dom import implementation>>> root=implementation.createDocument('http://foo/bar',None,None)>>> root<XML Document at xxx>>>> docElem=root.createElementNS('http://foo/bar','fb:officeFloor')>>> root.appendChild(docElem)




<Element Node at xxx: Name = 'fb:officeFloor' with 0 attributes and 3 children>

>>> nameElem=root.createElementNS('http://foo/bar','fb:name')>>> nameElem.appendChild(root.createTextNode('2nd floor')<Text Node at xxx: data = '2nd floor'>>>> nameElem.firstChild<Text Node at xxx: data = '2nd floor'>>>>>>> docElem.appendChild(nameElem)<Element Node at 16480588: Name = 'fb:name' with 0 attributes and 1 children>

Creating a tree of DOM nodes (continued)

Finally, you attach two elements to fb:officeFloor: fb:department1 andfb:department2. A Python 2.0 list-comprehension technique is used to confirm thefinal DOM hierarchy.

>>> docElem.appendChild(root.createElementNS('http://foo/bar','fb:department1'))<Element Node at xxx: Name = 'fb:department1' with 0 attributes and 0 children>>>> docElem.appendChild(root.createElementNS('http://foo/bar','fb:department2'))<Element Node at xxx: Name = 'fb:department2' with 0 attributes and 0 children>>>> docElem.childNodes<NodeList at xxx:

[<Element Node at xxx: Name = 'fb:name' with 0 attributes and 1 children>,<Element Node at xxx: Name = 'fb:department1' with 0 attributes and 0 children>,<Element Node at xxx: Name = 'fb:department2' with 0 attributes and 0 children>]>>>> [elem.nodeName for elem in docElem.childNodes]['fb:name', 'fb:department1', 'fb:department2']

Adding and modifying attributes (NamedNodeMaps)

Attributes can be modified directly through an element's NamedNodeMap. They canalso be modified indirectly using several Element interfaces provided specifically forattribute management.

Here's a demonstration using the DOM tree you just created:

>>> dept1=docElem.getElementsByTagNameNS('http://foo/bar','department1')[0]>>> dept1<Element Node at xxx: Name = 'fb:department1' with 0 attributes and 0 children>>>> elemNNM=dept1.attributes>>> elemNNM<NamedNodeMap at xxx: {}>

Adding and modifying attributes (NamedNodeMaps) continued

PyXml's NamedNodeMap implementation inherits from UserDict , a Python utilityclass which provides dictionary-like behavior for arbitrary classes (see Resources).As a result, it provides an additional dictionary interface for manipulating attribute




nodes housed in NamedNodeMap instances.

You'll set a "title" attribute on the fb:department1 element by adding it directly toits NamedNodeMap. This is done using the standard dictionary interface whichNamedNodeMaps inherit.

>>> attr=root.createAttributeNS('','title')>>> attr.value='Embedded Chips'>>> attr<Attribute Node at xxx: Name = "title", Value = "Embedded Chips">>>> elemNNM[('http://foo/bar','title')]=attr

You could also have set the "title" attribute using various other interfaces. Forinstance, you could have used setNamedItemNS (defined on NamedNodeMap).You could also have set this attribute by directly calling the setAttributeNSfunction on the element.

Adding and modifying attributes (Element interface)

The "title" attribute you just set on the element could also have been set using theElement interfaces provided for this purpose. For most attribute manipulation, theElement interfaces suffice. In particular, the setAttributeNS interface has theadvantage (over NamedNodeMaps) of not having to instantiate an attribute node. Anattribute can be set by simply specifying its name and a text value.

In the event where advanced attribute manipulations are needed (entity references)it may be more advantageous to work with the attribute nodes directly.

>>> dept2=docElem.getElementsByTagNameNS('http://foo/bar','department2')[0]>>> dept2.setAttributeNS('','title','Motherboards')>>> dept2.attributes<NamedNodeMap at xxx: {('', 'title'): <Attribute Node at 14450844: Name = "title",

Value = "Motherboards">}>

Adding and modifying attributes (Element interface) continued

The Element interface is flexible enough to allow retrieval of attributes as attributenode instances or simply as string values. Typically, users are only interested in thetext value associated with an attribute node, not the node itself.

>>> dept2.getAttributeNS('','title')'Marketing'>>> dept2.getAttributeNodeNS('','title')<Attribute Node at xxx: Name = "title", Value = "Motherboards">




Touching up the DOM tree

The DOM specification provides a variety of interfaces that allow virtually any kind oftree manipulation one could imagine. We'll continue with the example you have beenworking on.

If you wanted to replace the fb:department1 and fb:department1 elementswith two fb:department elements, you would first create a new fb:departmentelement -- setting its attribute appropriately -- and clone it.

>>> newdept1=root.createElementNS('http://foo/bar','fb:department')>>> newdept1.setAttributeNS('','title','Embedded Chips')>>> newdept2=newdept1.cloneNode(0)>>> newdept2<Element Node at xxx: Name = 'fb:department' with 1 attributes and 0 children>>>> newdept2.setAttributeNS('','title','Motherboards')

Touching up the DOM tree (continued)

Now that you have two fb:department elements, you will use the replaceChildinterface to replace the old elements with these newly generated counterparts.

>>> docElem.replaceChild(newdept1,dept1)<Element Node at xxx: Name = 'fb:department1' with 1 attributes and 0 children>>>> docElem.replaceChild(newdept2,dept2)<Element Node at xxx: Name = 'fb:department2' with 1 attributes and 0 children>>>> [elem.nodeName for elem in docElem.childNodes]['fb:name', 'fb:department', 'fb:department']

Section 6. Grab bag of advanced DOM techniques

Experiment with more ways to manipulate the DOM tree:

• Clone the initial staff tree to add a second department

• Do a deep clone of the staff node of descendant elements

• Clone nodes across XML documents with the Document interface

• Efficiently prune specific elements from an XML tree (for homogenousXML documents)

Cloning deep nodes




Now you'll go further with the example and create a staff hierarchy for bothfb:department elements. Instead of repeating the work, you'll clone the initial stafftree in order to populate the DOM tree with the second department.

>>> staff=root.createElementNS('http://foo/bar','fb:staff')>>> newdept1.appendChild(staff)<Element Node at xxx: Name = 'fb:staff' with 0 attributes and 0 children>>>> staff.appendChild(root.createElementNS('http://foo/bar','fb:marketing'))<Element Node at xxx: Name = 'fb:marketing' with 0 attributes and 0 children>>>> staff.appendChild(root.createElementNS('http://foo/bar','fb:executive'))<Element Node at xxx: Name = 'fb:executive' with 0 attributes and 0 children>

The first department now has a staff hierarchy that consists of an executive and amarketing branch.

Cloning deep nodes (continued)

Finally, you will do a deep clone of the staff node. This will also clone thefb:marketing and fb:executive descendant elements.

>>> staff2=staff.cloneNode(1)>>> newdept2.appendChild(staff2)<Element Node at xxx: Name = 'fb:staff' with 0 attributes and 2 children>>>> [elem.nodeName for elem in newdept2.getElementsByTagNameNS('*','*')]['fb:staff', 'fb:marketing', 'fb:executive']

At the end, another convenient use of list comprehension (see the Python tutoriallisted in Resources for more information on list comprehension) allows us to confirmthe structure of the cloned hierarchy.

Document's importNode interface

The Document interface provides a way to clone nodes across XML documents.This could be helpful if a user wanted to replicate a remote XML resource. The codelisting below shows how this could be done. Assume the remoteRootNode is aninstance of a DOM node persisted at a remote location.

>>> docElem = root.documentElement>>> docElem<Element Node at xxx: Name = 'fb:officeFloor' with 0 attributes and 3 children>>>> root.replaceChild( remoteRootNode, docElem )<Element Node at xxx: Name = 'fb:officeFloor' with 0 attributes and 3 children>

Pruning specific elements

Figure 3 is a diagram of the DOM tree you've been working with.




One trick that will probably come in handy for homogenous XML documents isefficiently pruning specific elements from an XML tree. This is easy to do using thegetElementsByTagNameNS interface (which, as indicated earlier, returns arecursive list of descendent elements identified by name).

You will remove all executive staff departments from the floor. You will iterate overthese elements using the above interface, pruning each with its remove interface.

>>> for elem in root.getElementsByTagNameNS('http://foo/bar','executive'):elem.parentNode.removeChild(elem)

<Element Node at 14961836: Name = 'fb:executive' with 0 attributes and 0 children><Element Node at 14983404: Name = 'fb:executive' with 0 attributes and 0 children>

Figure 3. Visual depiction of the DOM tree that you've worked with

Section 7. Printing, or writing DOM back out to file

xml.dom.ext.PrettyPrint: The heart of PyXml print formatting

As a result of a lack of simple XML visualization/editing tools, XML documents aremost likely to be browsed as text. This heightens the importance of printing XML in adigestable format.




To look at the default serialization of the XML document you've been working with,you'll use the PrettyPrint function (included with PyXML). PrettyPrint takes aDOM node and returns XML.

>>> from xml.dom.ext import PrettyPrint>>> PrettyPrint(root)

<fb:officeFloor xmlns:fb='http://foo/bar'><fb:name>2nd floor</fb:name><fb:department title='Embedded Chips'>

<fb:staff><fb:marketing/>

</fb:staff></fb:department><fb:department title='Motherboards'>


</fb:staff></fb:department>

</fb:officeFloor>

You will notice that, by default, the elements are indented according to depth. This istypically how XML is documented, and this feature can be calibrated in PyXml.

Modifying print indentation

The indentation character used can be modified, effectively letting you set theindentation width.

>>> PrettyPrint(root, indent='\t')<fb:officeFloor xmlns:fb='http://foo/bar'>

<fb:name>2nd floor</fb:name><fb:department title='Embedded Chips'>


</fb:staff></fb:department><fb:department title='Motherboards'>


</fb:staff></fb:department>

</fb:officeFloor>

The "\t" sets the indentation character to a hard tab character, which succeeds inexaggerating the indentation.

Printing to streams (files)

The PrettyPrint function serializes the document into a specified stream. The filestream is probably the most common stream to write XML to, and it is simple enoughto do.




>>> f=open('test.xml','w')>>> PrettyPrint(root,f)>>> f.close()>>> open('test.xml').read()"\012\012<fb:officeFloor xmlns:fb='http://foo/bar'>\012<fb:name>2nd floor</fb:name>\012

<fb:department title='Embedded Chips'>\012 <fb:staff>\012<fb:marketing/>\012 </fb:staff>\012 </fb:department>\012

<fb:department title='Motherboards'>\012 <f:staff>\012 <fb:marketing/>\012</fb:staff>\012 </fb:department>\012</fb:officeFloor>\012"

Printing to streams (strings)

Since Python provides a means to manipulate strings as streams, serializing a DOMtree as text is equally simple.

For example, lets say you want to use the string.replace function to change thenamespace of the fb elements to "http://FuBu."

>>> from cStringIO import StringIO>>> strIO=StringIO()>>> PrettyPrint(root,strIO)>>> from string import replace>>> replace(strIO.getvalue(),'foo/bar','FuBu')"\012\012<fb:officeFloor xmlns:fb='http://FuBu'>\012<fb:name>2nd floor</fb:name>\012

<fb:department title='Embedded Chips'>\012 <fb:staff>\012 <fb:marketing/>\012</fb:staff>\012 </fb:department>\012

<fb:department title='Motherboards'>\012 <fb:staff>\012 <fb:marketing/>\012</fb:staff>\012 </fb:department>\012</fb:officeFloor>\012"

Printing to streams (Standard Out and others)

By default, PrettyPrint serializes the specified XML document to the StandardOut stream (sys.stdout). However, any stream that adheres to Python's FileDescriptor interface can be written to.

Section 8. The traversal interfaces

You can traverse the DOM tree through

• Node iterators

• Tree walkers and filters




Node iterator

PyXml provides implementations of the traversal interfaces of DOM. This allowscontrolled traversal of a DOM tree without having to resort to complex code. Forexample, let's print a node iteration of the document elements.

>>> from xml.dom.NodeFilter import NodeFilter>>> iterator=root.createNodeIterator(root,NodeFilter.SHOW_ELEMENT,None,0)>>> elem=iterator.nextNode()>>> while elem:

print elem.nodeNameelem=iterator.nextNode()

fb:officeFloorfb:namefb:departmentfb:stafffb:marketingfb:departmentfb:stafffb:marketing

Node iterator (continued)

Upon creation, iterators are associated with a NodeFilter which determines whichnodes to ignore. The NodeFilter class provides some default filters that allow youto filter based on node type. In the previous case, it specified that only elements areiterated over.

For demonstration purposes, iterate over the entire set of nodes:

>>> iterator=root.createNodeIterator(root,NodeFilter.SHOW_ALL,None,0)>>> node=iterator.nextNode()>>> while node:

print node.nodeNamenode=iterator.nextNode()

#documentfb:officeFloorfb:name#textfb:departmentfb:stafffb:marketingfb:departmentfb:stafffb:marketing

You'll notice the two new prints representing the document node and the text nodeunder fb:name.

Tree walkers and filters




The DOM specification provides a TreeWalker interface, which (as the namesuggests) provides a means to traverse DOM documents as trees. PyXml containsan implementation for this interface as well.

There may be situations where the default filters may not be appropriate for adeveloper's needs. DOM allows a developer to subclass the NodeFilter interfacein order to specify more directly which nodes to accept/reject.

Consider the need to view the officeFloor hierarchy at a level no lower thandepartment nodes. The following class would need to be defined:

>>> class ignoreStaff(NodeFilter):def acceptNode(self,node):

if node.localName=='staff':return self.FILTER_REJECT

elif node.parentNode and node.parentNode.localName=='staff':return self.FILTER_REJECT

else: return self.FILTER_ACCEPT

This filter will reject any staff elements or elements directly descended from them.

Tree walkers and filters (continued)

Now you will instantiate the new iterator (using the filter) and print the iteration:

>>> iterator=root.createNodeIterator(root,NodeFilter.SHOW_ELEMENT,ignoreStaff(),0)>>> node=iterator.nextNode()>>> while node:

print node.nodeName,[ '%s:%s'%(attrEntry[0][-1],attrEntry[-1].value)for attrEntry in node.attributes.items()]

node=iterator.nextNode()

You can use another concise list-comprehension trick to print the attributes of eachnode.

fb:officeFloor []fb:name []fb:department ['title:Embedded Chips']fb:department ['title:Motherboards']

Section 9. The Event interface




PyXml's MutationEvent

PyXml implements a small portion of the DOM 2 Event interface. It implementsMutationEvent, under which there are seven event types:

• DOMSubtreeModified

• DOMNodeInserted

• DOMNodeRemoved

• DOMNodeRemovedFromDocument

• DOMNodeInsertedIntoDocument

• DOMAttrModified

• DOMCharacterDataModified

For an in-depth look at the specifics of this event and our implementation of theEvent interfaces, take a look at the Event.py module in xml.dom.

Defining a listener

PyXml automatically fires MutationEvents that correspond with changes to theDOM tree. All you need to do to capture these events is to define an event listenerclass which handles the events and registers an instance of this class to any node inthe tree you wish to monitor.

You can define an event listener which prints information about each event itreceives and handles the following MutationEvents:

• DOMSubtreeModified

• DOMNodeInserted

• DOMAttrModified

• DOMCharacterDataModified

class DemoEventListener:def __init__(self):

passdef handleEvent(self,evt):

print evt.type, " event, has been fired!"if evt.type=='DOMSubtreeModified':print "%s's sub tree has been modified!"%(evt.target.nodeName)

elif evt.type=='DOMNodeInserted':print "%s was inserted into a DOM tree"%(evt.target.nodeName)

elif evt.type=='DOMAttrModified':if evt.attrChange==Event.MutationEvent.REMOVAL:

print "attribute %s, was removedfrom %s"%(evt.attrName,evt.relatedNode.ownerElement.nodeName)

elif evt.attrChange==Event.MutationEvent.ADDITION:




print "attribute %s set to %s"%(evt.attrName,evt.newValue)else:

print "%s's %s attribute has been modifiedfrom %s to %s"%(evt.target.nodeName,evt.attrName,evt.prevValue,evt.newValue)

elif evt.type=='DOMCharacterDataModified':print "Character Data changed from %s to %s"%(evt.prevValue,evt.newValue)

Registering the listener

Now that you have defined an appropriate listener, you need to register it at everynode as being able to handle the four MutationEvents.

>>> iterator=root.createNodeIterator(root,NodeFilter.SHOW_ALL,None,0)>>> node=iterator.nextNode()>>> while node:

node.addEventListener('DOMSubtreeModified',DemoEventListener(),0)node.addEventListener('DOMNodeInserted',DemoEventListener(),0)node.addEventListener('DOMAttrModified',DemoEventListener(),0)node.addEventListener('DOMCharacterDataModified',DemoEventListener(),0)node=iterator.nextNode()

You can use the Iterator interface to walk to every node in the DOM tree andregister an instance of DemoEventListener as a listener to the fourMutationEvents.

Setting up for DOM mutations

You now need to mutate the DOM tree in order for PyXml to fire MutationEventstowards the listeners we registered.

We'll clone the customer service department element (this doesn't fire an event) inorder to create a new department: Research and Development.

>>> custService=root.getElementsByTagNameNS('http://foo/bar','department')[0]>>> custService<Element Node at xxx: Name = 'fb:department' with 1 attributes and 1 children>>>> RandD=custService.cloneNode(1)

Setting up for DOM mutations (continued)

Before you change the title of the cloned department to "Research andDevelopment," you'll need to register the four MutationEvents so you can getprintouts from the listeners that receive events.

>>> RandD.addEventListener('DOMSubtreeModified',DemoEventListener(),0)>>> RandD.addEventListener('DOMNodeInserted',DemoEventListener(),0)>>> RandD.addEventListener('DOMAttrModified',DemoEventListener(),0)>>> RandD.addEventListener('DOMCharacterDataModified',DemoEventListener(),0)




Now, when we set the title we should get a response from the listener registered forthe DOMAttrModified event.

Firing MutationEvents for the listener

>>> RandD.setAttributeNS('','title','Research and Development')DOMAttrModified event, has been fired!fb:department's title attribute has been modified from Customer Service

to Research and DevelopmentDOMSubtreeModified event, has been fired!fb:department's sub tree has been modified!

When you set the "title" attribute, two attributes are fired: DOMAttrModified andDOMSubtreeModified. The listener handles the events and prints out aninformative description about the specifics of the modification.

Firing MutationEvents for the listener (continued)

Finally, you'll attach the new node into the tree, firing off events where appropriate.

>>> root.documentElement.insertBefore(RandD,custService)DOMNodeInserted event, has been fired!fb:department was inserted into a DOM treeDOMNodeInserted event, has been fired!fb:department was inserted into a DOM treeDOMNodeInserted event, has been fired!fb:department was inserted into a DOM treeDOMSubtreeModified event, has been fired!fb:officeFloor's sub tree has been modified!DOMSubtreeModified event, has been fired!fb:officeFloor's sub tree has been modified!<Element Node at xxx: Name = 'fb:department' with 1 attributes and 3 children>

This causes printouts about two events: DOMNodeInserted and DOMSubtree. Youwill notice the messages are replicated. This is because MutationEvents bubbleup the DOM tree and are received by multiple event targets.

Section 10. Introduction to HTML DOM

DOM HTML and DOM core

DOM HTML essentially introduces two new interfaces:




• HTMLDocument: Inherits from the Document interface and definesoperation specific to a HTML document.

• HTMLElement: Inherits from the Element interface. It provides methodsfor retrieval and modification to attributes specific to HTML elements.

DOM HTML also provides mechanisms for manipulating the style of a document viaCSS. It is important to note that DOM HTML is specifically meant for HTML 4.0documents and not for XHTML 1.0 documents.

Creating an HTML page

DOM HTML defines a specific interface for each element in the set of HTML 4.0tags. These interfaces inherit from the HTMLElement interface and mainly addmechanisms for conveniently setting the attributes specific to represented HTML 4.0tags.

The best way to understand the general theme of the DOM HTML interface is towalk through an example of creating an HTML document.

>>> from xml.dom import implementation>>> d = implementation.createHTMLDocument('Test Page')>>> d<HTML Document at xxx>

In order to instantiate a HTML document, you'll need to do it uniformly via theDOMImplentation.createHTMLDocument interface. This function takes a stringwhich will be used as the title of the HTML document.

Creating an HTML page: Printing the HTML

As you progress through the example, you'll use PrettyPrint to observe anHTML serialization of the HTML document. PyXml's PrettyPrinter is capable ofprinting XML and HTML documents.

>>> from xml.dom.ext import PrettyPrint>>> PrettyPrint(d)

<HTML><HEAD>

<TITLE>Test Page</TITLE></HEAD><BODY></BODY>

</HTML>

Creating an HTML page: Setting up the BODY element




Now you'll create a body element to insert into the HTML document. You will use theattributes specific to a body element to set the background and text color.

>>> b = d.createElement('Body');>>> b<Element Node at xxx: Name = 'BODY' with 0 attributes and 0 children>>>> b.text="#000000">>> b.bgColor="#FFFFFF">>> b<Element Node at xxx: Name = 'BODY' with 2 attributes and 0 children>

Notice how accessing the text and bgColor automatically updates the underlyingattribute hierarchy.

>>> [(attr.name,attr.value) for attr in b.attributes][('BGCOLOR', '#FFFFFF'), ('TEXT', '#000000')]

Let's view the HTML serialization of the our document.

>>> d.body=b>>> PrettyPrint(d)

<HTML><HEAD>

<TITLE>Test Page</TITLE></HEAD><BODY BGCOLOR='#FFFFFF' TEXT='#000000'></BODY>

</HTML>

Creating an HTML page: Adding a paragraph

Now you'll insert a paragraph element into the body. Notice that the paragraphcontent is set up using a conventional TextNode.

>>> p1=d.createElement('P')>>> p1.align="center">>> p1.appendChild(d.createTextNode("This is a centered paragraph"))<Text Node at xxx: data = 'This is a centered p...'>>>> b.appendChild(p1)<Element Node at xxx: Name = 'P' with 1 attributes and 1 children>>>> PrettyPrint(d)

<HTML><HEAD>

<TITLE>Test Page</TITLE></HEAD><BODY BGCOLOR='#FFFFFF' TEXT='#000000'>

This is a centered paragraph</BODY>

</HTML>




Creating an HTML page: Specifying fonts and links

Now, follow the example through for two more HTML 4.0 elements:

• FONT

• A

>>> f=d.createElement('FONT')>>> f.size="+2">>> f.appendChild(p1.firstChild)<Text Node at 11c921c: data = 'This is a centered p...'>>>> p1.appendChild(f)<Element Node at 18665484: Name = 'FONT' with 1 attributes and 1 children>>>> a = d.createElement('A')>>> a.href="http://4suite.org">>> a.target="_blank">>> a.appendChild(d.createTextNode("Home of 4Suite"))<Text Node at 120090c: data = 'Home of 4Suite'>>>> d.body.appendChild(a)<Element Node at 18796636: Name = 'A' with 2 attributes and 1 children>>>> PrettyPrint(d)

<HTML><HEAD>


This is a centered paragraph

<A HREF='http://4suite.org' TARGET='_blank'>Home of 4Suite</A>

</BODY></HTML>

Creating an HTML page: Creating an unordered list

Now use a for loop to automate the process of creating an unordered list:

>>> ul=d.createElement('UL')>>> for index in range(1,5):

li=d.createElement('LI')li.appendChild(d.createTextNode("Line item %s"%(index)))ul.appendChild(li)

<Text Node at xxx: data = 'Line item 1'><Element Node at xxx: Name = 'LI' with 0 attributes and 1 children><Text Node at xxx: data = 'Line item 2'><Element Node at xxx: Name = 'LI' with 0 attributes and 1 children><Text Node at xxx: data = 'Line item 3'><Element Node at xxx: Name = 'LI' with 0 attributes and 1 children><Text Node at xxx: data = 'Line item 4'><Element Node at xxx: Name = 'LI' with 0 attributes and 1 children>

Now you can take a look at the final HTML source of your document:




>>> d.body.appendChild(ul)<Element Node at xxx: Name = 'UL' with 0 attributes and 4 children>>>> PrettyPrint(d)

<HTML><HEAD>


This is a centered paragraph

<A HREF='http://4suite.org' TARGET='_blank'>Home of 4Suite</A><UL><LI>Line item 1</LI><LI>Line item 2</LI><LI>Line item 3</LI><LI>Line item 4</LI>

</UL></BODY>

</HTML>

Section 11. Conclusion

Conclusion

This tutorial was meant to give you enough initial coverage of the PyXmlimplementation to begin serious development with PyXml. For the next step, wewould suggest having the DOM Level 2 Specification in one hand and an opensession with a Python interpreter in the other. Then, try out the various interfacesthis tutorial didn't cover. The best method for learning DOM and/or a DOMimplementation is through hands-on programming.




Resources

Learn

• Learn more about PyXml and the philosophy behind 4Suite at Fourthought'shome page.

• Get the latest news, downloads, and information for the Python community atPython.org.

• Learn more about the SAX event handler at the Python Library Reference.

• Go to Python.org's library for a complete discussion of Python XML parsingusing Expat, PyXml's default XML parser. Python.org's library also providesdocumentation on SAX event handlers.

• Check out UserDict, a Python utility class which provides dictionary-likebehavior for arbitrary classes.

• The W3C's DOM Level 2 Specification page gives important prerequisiteinformation for working through this tutorial. DOM is a platform- andlanguage-independent interface that allows programs to manipulate the contentand structure of a document.

• Python.org's "Python Tutorial," by Guido van Rossum and Fred L. Drake, Jr.,offers an excellent introduction to the language.

• See the XML Cover Pages for specifications on DOM Levels 1, 2, and 3, as wellas links to articles and other DOM references.

• For those wanting to look at DOM from a Java perspective, the "UnderstandingDOM" tutorial shows you the structure of a DOM document as well as how touse Java to create a document from an XML file, make changes to it, andretrieve the output.

• Part 2 (4XPath and 4XSLT), gives a thorough introduction to key XMLtechnologies XPath and XSLT for identifying nodes in an XML document's treeand for transforming documents, and it shows how to use them in the 4Suitetool set (October 2001).

• Part 3 (4RDF) introduces RDF and details how to work with it using the 4Suitetool set (July 2002).

• Part 4 (Composition and updates) introduces the W3C XML specificationsXPointer, XInclude, and XML Base and the independent specification XUpdate,which offers an alternative to XSLT and DOM parsing for updating parts of XMLdocuments (October 2002).

• Part 5 (The repository features) shows how to use the popular open-source4Suite toolkit for XML processing to create a Web repository application(December 2002).

• developerWorks XML zone: Find more XML resources here, including articles,tutorials, tips, and standards.



http://fourthought.com

http://fourthought.com

http://www.python.org

http://www.python.org/doc/current/lib/module-xml.sax.handler.html

http://www.python.org/doc/current/lib/module-xml.parsers.expat.html

http://www.python.org/doc/current/lib/module-xml.parsers.expat.html

http://www.python.org/doc/current/lib/module-xml.sax.handler.html

http://www.python.org/doc/current/lib/module-UserDict.html

http://www.w3.org/DOM

http://www.python.org/doc/current/tut/tut.html

http://www.oasis-open.org/cover/dom.html

http://www.ibm.com/developerworks/edu/x-dw-xudom-i.html?S_TACT=105AGX06&S_CMP=TUT

http://www.ibm.com/developerworks/edu/x-dw-xudom-i.html?S_TACT=105AGX06&S_CMP=TUT

http://www.ibm.com/developerworks/edu/x-dw-x4suit2-i.html?S_TACT=105AGX06&S_CMP=TUT

http://www.ibm.com/developerworks/edu/x-dw-x4suite3-i.html?S_TACT=105AGX06&S_CMP=TUT



http://www.ibm.com/developerworks/xml/


• IBM Certified Solution Developer -- XML and related technologies: Learn how toget certified.

Get products and technologies

• The following software packages are required in order to follow this tutorial:

• Python 2.1: Python is the object oriented language that PyXml is written in.

• PyXml: One of the many DOM implementations. This one is written inPython and is packaged as a Python library.

Discuss

• XML forums on developerWorks: Discuss all aspects of working with XML.

About the authors

Uche OgbujiUche Ogbuji is a computer engineer, co-founder, and principal consultant atFourthought Inc. He has worked with XML for several years, codeveloping 4Suite, alibrary of open-source tools for XML development in Python, and 4Suite Server, anopen-source, cross-platform XML data server providing standards-based XMLsolutions. He writes articles on XML for IBM developerWorks, LinuxWorld, SunWorldand XML.com. Mr. Ogbuji is a Nigerian immigrant living in Boulder, CO.

Chimezie OgbujiChimezie Thomas-Ogbuji is a software consultant for Fourthought Inc. Hecodevelops 4Suite and 4Suite Server. He enjoys writing and developing computergames in his spare time. He also researches artificial intelligence techniques.



http://www.ibm.com/certify/certs/adcdxmlrt.shtml

http://www.python.org/download

http://sourceforge.net/projects/pyxml

http://www.ibm.com/developerworks/forums/dw_xforums.jsp

http://Fourthought.com

http://4Suite.org

http://python.org

Date post:	05-Jun-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

IBM - Develop Python/XML with 4Suite, Part 1: …Develop Python/XML with 4Suite, Part 1: Process XML...

Documents