+ All Categories
Home > Documents > 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely...

1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely...

Date post: 21-Dec-2015
Category:
View: 218 times
Download: 0 times
Share this document with a friend
17
1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly becoming standard for data exchange between applications
Transcript
Page 1: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

1

Extensible Markup Language: XML

• HTML: widely supported protocol for formatting data

• XML: widely supported protocol for describing data

• XML is quickly becoming standard for data exchange between applications

Page 2: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

Root element contains all other document elements

Optional XML declaration includes version information parameter (MUST be very first line of file)

Because of the nice <tag>.. </tag> structure, the data can be viewed as organized in a tree:

article

title date author summary content

firstName lastName

Page 3: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

<?xml version = "1.0"?>

<!– I-sequence structured as XML. -->

<SEQUENCEDATA>

<TYPE>dna</TYPE>

<SEQ>

<NAME>Aspergillus awamori</NAME>

<ID>U03518</ID>

<DATA>aacctgcggaaggatcattaccgagtgcgggtcctttgggccca

acctcccatccgtgtctattgtaccctgttgcttcgg

cgggcccgccgcttgtcggccgccgggggggcgcctctg

ccccccgggcccgtgcccgccggagaccccaacacgaac

actgtctgaaagcgtgcagtctgagttgattgaatgcaat

cagttaaaactttcaacaatggatctcttggttccggc

</DATA>

</SEQ>

</SEQUENCEDATA>

An I-sequence might be

structured as XML like this..

SEQUENCEDATA

TYPE SEQ

DATAIDNAME

comment

Page 4: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

1

XML is standard: Parsers exist already!

Minus sign

Each parent element/node can be expanded and collapsed

Plus sign

Standard browsers can format XML documents nicely!

Page 5: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

1

Python offers a Document Object Model parser!

• A DOM parser returns the whole XML document represented as a tree• All nodes have name (of tag) and value (data)

• Text (including whitespace) represented in nodes with tag name #text

article

title

#text

#text

#text

#text

date

author

summary

content

#text

#text

#text

firstName

#text

lastName

#text

#text

Simple XML

#text

Dec..2001

#text

XML..easy.

#text

In this..XML.

#text

John

#text

Doe

Page 6: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

deite

l_fig

16_0

4rev

ised

.py

Parse XML document and load data into variable document

documentElement attribute refers to root node

nodeName refers to element’s tag name

Various node attributes:

firstChild

nextSibling

nodeValue

parentNode

NB: Changes since book!

Page 7: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

1

Program output

The first child of root element is: #textwhose next sibling is: titleText inside "title" tag is Simple XMLParent node of title is: article

Here is the root element of the document: articleThe following are its child elements:#texttitle#textdate#textauthor#textsummary#textcontent#text

article

title

#text

#text

#text

#text

date

author

summary

content

#text

#text

#text

firstName

#text

lastName

#text

#text

Simple XML

#text

Dec..2001

#text

XML..easy.

#text

In this..XML.

#text

John

#text

Doe

Page 8: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

1

Parsing XML sequence?

• We have i2xml filter (exercise) – we want xml2i also

• New XML structure for Isequences: holds more than one

• Algorithm:– Open file– Use Python parser to obtain the DOM tree– Traverse tree to extract sequence information, build Isequence

objects

SEQUENCEDATA

SEQ (type)

DATAIDNAME

SEQ (type)

DATAIDNAME

Ignoring whitespace nodes, we have to search a tree like this:

Page 9: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

We're still being systematic: Usual name for parse method

Obtain a parse tree with the xml data for free

xml2

i.py

(par

t 1)

SEQUENCEDATA

SEQ (type)SEQ (type)

Convert this SEQ subtree to an Isequence object

Page 10: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

xml2

i.py

(par

t 2)

SEQ (type)

DATAIDNAME

Way of getting to all attributes of a node

Way of getting to the value of a specific attribute

Recall: text kept in a #text node underneath

#text

Page 11: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

1

See all the methods and attributes of a DOM tree on pages 537ff

Attribute/Method Description

appendChild( newChild ) Appends newChild to the list of child nodes. Returns the appended child node.

attributes NamedNodeMap that contains the attribute nodes for the current node.

childNodes NodeList that contains the node’s current children.

firstChild First child node in the NodeList or None, if the node has no children.

insertBefore( newChild,

refChild )

Inserts the newChild node before the refChild node. refChild must be a child node of the current node; otherwise, insertBefore raises a

ValueError exception.

isSameNode( other ) Returns true if other is the current node.

lastChild Last child node in the NodeList or None, if the current node has no children.

nextSibling The next node in the NodeList, or None, if the node has no next sibling.

nodeName Name of the node, or None, if the node does not have a name.

Possible to manipulate the DOM tree using these methods (add new nodes, remove nodes, set attributes etc.)

Page 12: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

1

Convert old format XML sequence to new format

SEQUENCEDATA

TYPE SEQ

DATAIDNAME

Old format: sequence type has its own tag TYPE

SEQUENCEDATA

SEQ (type)

DATAIDNAME

New format: sequence type is attribute of SEQ tag

Page 13: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

old_

xml2

i.py

Add new method to original xml2i.py and call it after parsing the XML file

Page 14: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

old_

xml2

phyl

ip.p

y

Import new module

Check that type information is saved in the Isequence (not used in phylip format)

Page 15: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

1

Testing on old format XML sequence

<?xml version = "1.0"?> <SEQUENCEDATA> <TYPE>dna</TYPE> <SEQ> <NAME>Aspergillus awamori</NAME> <ID>U03518</ID>

<DATA>aacctgcggaaggatcattaccgagtgcgggtcctttgggcccaacctcccatccgtgtctattgtaccctgttgcttcggcgggcccgccgcttgtcggccgccgggggggcgcctctgccccccgggcccgtgcccgccggagaccccaacacgaacactgtctgaaagcgtgcagtctgagttgattgaatgcaatcagttaaaactttcaacaatggatctcttggttccggc</DATA> </SEQ> </SEQUENCEDATA>U03518b.xml

python old_xml2phylip.py U03518b.xml U03518b

sequence is of type dna

Page 16: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

1

Remark: book uses old version of DOM parser

• XML examples in book won’t work (except the revised fig16.04)

• Look in the presented example programs to see what you have to import

• All the methods and attributes of a DOM tree on pages 537ff are the same

Page 17: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.

1

.. on to the exercises


Recommended