Python Data Persistence - XML Parsers


Advertisements

XML is acronym for eXtensible Markup Language. It is a portable, open source and cross platform language very much like HTML or SGML and recommended by the World Wide Web Consortium.

It is a well-known data interchange format, used by a large number of applications such as web services, office tools, and Service Oriented Architectures (SOA). XML format is both machine readable and human readable.

Standard Python library's xml package consists of following modules for XML processing −

Sr.No. Modules & Description
1

xml.etree.ElementTree

the ElementTree API, a simple and lightweight XML processor

2

xml.dom

the DOM API definition

3

xml.dom.minidom

a minimal DOM implementation

4

xml.sax

SAX2 interface implementation

5

xml.parsers.expat

the Expat parser binding

Data in the XML document is arranged in a tree-like hierarchical format, starting with root and elements. Each element is a single node in the tree and has an attribute enclosed in <> and </> tags. One or more sub-elements may be assigned to each element.

Following is a typical example of a XML document −

<?xml version = "1.0" encoding = "iso-8859-1"?>
<studentlist>
   <student>
      <name>Ratna</name>
      <subject>Physics</subject>
      <marks>85&lt/marks>
   </student>
   <student>
      <name>Kiran</name>
      <subject>Maths</subject>
      <marks>100</marks>
   </student>
   <student>
      <name>Mohit</name>
      <subject>Biology&lt/subject>
      <marks>92</marks>
   </student>
</studentlist>

While using ElementTree module, first step is to set up root element of the tree. Each Element has a tag and attrib which is a dict object. For the root element, an attrib is an empty dictionary.

import xml.etree.ElementTree as xmlobj
root=xmlobj.Element('studentList')

Now, we can add one or more elements under root element. Each element object may have SubElements. Each subelement has an attribute and text property.

student=xmlobj.Element('student')
   nm=xmlobj.SubElement(student, 'name')
   nm.text='name'
   subject=xmlobj.SubElement(student, 'subject')
   nm.text='Ratna'
   subject.text='Physics'
   marks=xmlobj.SubElement(student, 'marks')
   marks.text='85'

This new element is appended to the root using append() method.

root.append(student)

Append as many elements as desired using above method. Finally, the root element object is written to a file.

tree = xmlobj.ElementTree(root)
   file = open('studentlist.xml','wb')
   tree.write(file)
   file.close()

Now, we see how to parse the XML file. For that, construct document tree giving its name as file parameter in ElementTree constructor.

tree = xmlobj.ElementTree(file='studentlist.xml')

The tree object has getroot() method to obtain root element and getchildren() returns a list of elements below it.

root = tree.getroot()
children = root.getchildren()

A dictionary object corresponding to each sub element is constructed by iterating over sub-element collection of each child node.

for child in children:
   student={}
   pairs = child.getchildren()
   for pair in pairs:
      product[pair.tag]=pair.text

Each dictionary is then appended to a list returning original list of dictionary objects.

SAX is a standard interface for event-driven XML parsing. Parsing XML with SAX requires ContentHandler by subclassing xml.sax.ContentHandler. You register callbacks for events of interest and then, let the parser proceed through the document.

SAX is useful when your documents are large or you have memory limitations as it parses the file as it reads it from disk as a result entire file is never stored in the memory.

Document Object Model

(DOM) API is a World Wide Web Consortium recommendation. In this case, entire file is read into the memory and stored in a hierarchical (tree-based) form to represent all the features of an XML document.

SAX, not as fast as DOM, with large files. On the other hand, DOM can kill resources, if used on many small files. SAX is read-only, while DOM allows changes to the XML file.

Advertisements