476 lines
		
	
	
		
			20 KiB
		
	
	
	
		
			HTML
		
	
	
	
			
		
		
	
	
			476 lines
		
	
	
		
			20 KiB
		
	
	
	
		
			HTML
		
	
	
	
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 | 
						|
    "http://www.w3.org/TR/html4/loose.dtd">
 | 
						|
<html>
 | 
						|
<head>
 | 
						|
  <meta http-equiv="Content-Type" content="text/html">
 | 
						|
  <style type="text/css"></style>
 | 
						|
<!--
 | 
						|
TD {font-family: Verdana,Arial,Helvetica}
 | 
						|
BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
 | 
						|
H1 {font-family: Verdana,Arial,Helvetica}
 | 
						|
H2 {font-family: Verdana,Arial,Helvetica}
 | 
						|
H3 {font-family: Verdana,Arial,Helvetica}
 | 
						|
A:link, A:visited, A:active { text-decoration: underline }
 | 
						|
  </style>
 | 
						|
-->
 | 
						|
  <title>Libxml2 XmlTextReader Interface tutorial</title>
 | 
						|
</head>
 | 
						|
 | 
						|
<body bgcolor="#fffacd" text="#000000">
 | 
						|
<h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>
 | 
						|
 | 
						|
<p></p>
 | 
						|
 | 
						|
<p>This document describes the use of the XmlTextReader streaming API added
 | 
						|
to libxml2 in version 2.5.0 . This API is closely modeled after the <a
 | 
						|
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
 | 
						|
and <a
 | 
						|
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
 | 
						|
classes of the C# language.</p>
 | 
						|
 | 
						|
<p>This tutorial will present the key points of this API, and working
 | 
						|
examples using both C and the Python bindings:</p>
 | 
						|
 | 
						|
<p>Table of content:</p>
 | 
						|
<ul>
 | 
						|
  <li><a href="#Introducti">Introduction: why a new API</a></li>
 | 
						|
  <li><a href="#Walking">Walking a simple tree</a></li>
 | 
						|
  <li><a href="#Extracting">Extracting information for the current
 | 
						|
  node</a></li>
 | 
						|
  <li><a href="#Extracting1">Extracting information for the
 | 
						|
  attributes</a></li>
 | 
						|
  <li><a href="#Validating">Validating a document</a></li>
 | 
						|
  <li><a href="#Entities">Entities substitution</a></li>
 | 
						|
  <li><a href="#L1142">Relax-NG Validation</a></li>
 | 
						|
  <li><a href="#Mixing">Mixing the reader and tree or XPath
 | 
						|
  operations</a></li>
 | 
						|
</ul>
 | 
						|
 | 
						|
<p></p>
 | 
						|
 | 
						|
<h2><a name="Introducti">Introduction: why a new API</a></h2>
 | 
						|
 | 
						|
<p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
 | 
						|
tree based</a>, where the parsing operation results in a document loaded
 | 
						|
completely in memory, and expose it as a tree of nodes all available at the
 | 
						|
same time. This is very simple and quite powerful, but has the major
 | 
						|
limitation that the size of the document that can be hamdled is limited by
 | 
						|
the size of the memory available. Libxml2 also provide a <a
 | 
						|
href="http://www.saxproject.org/">SAX</a> based API, but that version was
 | 
						|
designed upon one of the early <a
 | 
						|
href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
 | 
						|
also not formally defined for C. SAX basically work by registering callbacks
 | 
						|
which are called directly by the parser as it progresses through the document
 | 
						|
streams. The problem is that this programming model is relatively complex,
 | 
						|
not well standardized, cannot provide validation directly, makes entity,
 | 
						|
namespace and base processing relatively hard.</p>
 | 
						|
 | 
						|
<p>The <a
 | 
						|
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
 | 
						|
API from C#</a> provides a far simpler programming model. The API acts as a
 | 
						|
cursor going forward on the document stream and stopping at each node in the
 | 
						|
way. The user's code keeps control of the progress and simply calls a
 | 
						|
Read() function repeatedly to progress to each node in sequence in document
 | 
						|
order. There is direct support for namespaces, xml:base, entity handling and
 | 
						|
adding DTD validation on top of it was relatively simple. This API is really
 | 
						|
close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
 | 
						|
specification</a> This provides a far more standard, easy to use and powerful
 | 
						|
API than the existing SAX. Moreover integrating extension features based on
 | 
						|
the tree seems relatively easy.</p>
 | 
						|
 | 
						|
<p>In a nutshell the XmlTextReader API provides a simpler, more standard and
 | 
						|
more extensible interface to handle large documents than the existing SAX
 | 
						|
version.</p>
 | 
						|
 | 
						|
<h2><a name="Walking">Walking a simple tree</a></h2>
 | 
						|
 | 
						|
<p>Basically the XmlTextReader API is a forward only tree walking interface.
 | 
						|
The basic steps are:</p>
 | 
						|
<ol>
 | 
						|
  <li>prepare a reader context operating on some input</li>
 | 
						|
  <li>run a loop iterating over all nodes in the document</li>
 | 
						|
  <li>free up the reader context</li>
 | 
						|
</ol>
 | 
						|
 | 
						|
<p>Here is a basic C sample doing this:</p>
 | 
						|
<pre>#include <libxml/xmlreader.h>
 | 
						|
 | 
						|
void processNode(xmlTextReaderPtr reader) {
 | 
						|
    /* handling of a node in the tree */
 | 
						|
}
 | 
						|
 | 
						|
int streamFile(char *filename) {
 | 
						|
    xmlTextReaderPtr reader;
 | 
						|
    int ret;
 | 
						|
 | 
						|
    reader = xmlNewTextReaderFilename(filename);
 | 
						|
    if (reader != NULL) {
 | 
						|
        ret = xmlTextReaderRead(reader);
 | 
						|
        while (ret == 1) {
 | 
						|
            processNode(reader);
 | 
						|
            ret = xmlTextReaderRead(reader);
 | 
						|
        }
 | 
						|
        xmlFreeTextReader(reader);
 | 
						|
        if (ret != 0) {
 | 
						|
            printf("%s : failed to parse\n", filename);
 | 
						|
        }
 | 
						|
    } else {
 | 
						|
        printf("Unable to open %s\n", filename);
 | 
						|
    }
 | 
						|
}</pre>
 | 
						|
 | 
						|
<p>A few things to notice:</p>
 | 
						|
<ul>
 | 
						|
  <li>the include file needed : <code>libxml/xmlreader.h</code></li>
 | 
						|
  <li>the creation of the reader using a filename</li>
 | 
						|
  <li>the repeated call to xmlTextReaderRead() and how any return value
 | 
						|
    different from 1 should stop the loop</li>
 | 
						|
  <li>that a negative return means a parsing error</li>
 | 
						|
  <li>how xmlFreeTextReader() should be used to free up the resources used by
 | 
						|
    the reader.</li>
 | 
						|
</ul>
 | 
						|
 | 
						|
<p>Here is similar code in python for exactly the same processing:</p>
 | 
						|
<pre>import libxml2
 | 
						|
 | 
						|
def processNode(reader):
 | 
						|
    pass
 | 
						|
 | 
						|
def streamFile(filename):
 | 
						|
    try:
 | 
						|
        reader = libxml2.newTextReaderFilename(filename)
 | 
						|
    except:
 | 
						|
        print "unable to open %s" % (filename)
 | 
						|
        return
 | 
						|
 | 
						|
    ret = reader.Read()
 | 
						|
    while ret == 1:
 | 
						|
        processNode(reader)
 | 
						|
        ret = reader.Read()
 | 
						|
 | 
						|
    if ret != 0:
 | 
						|
        print "%s : failed to parse" % (filename)</pre>
 | 
						|
 | 
						|
<p>The only things worth adding are that the <a
 | 
						|
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
 | 
						|
is abstracted as a class like in C#</a> with the same method names (but the
 | 
						|
properties are currently accessed with methods) and that one doesn't need to
 | 
						|
free the reader at the end of the processing. It will get garbage collected
 | 
						|
once all references have disappeared.</p>
 | 
						|
 | 
						|
<h2><a name="Extracting">Extracting information for the current node</a></h2>
 | 
						|
 | 
						|
<p>So far the example code did not indicate how information was extracted
 | 
						|
from the reader. It was abstrated as a call to the processNode() routine,
 | 
						|
with the reader as the argument. At each invocation, the parser is stopped on
 | 
						|
a given node and the reader can be used to query those node properties. Each
 | 
						|
<em>Property</em> is available at the C level as a function taking a single
 | 
						|
xmlTextReaderPtr argument whose name is
 | 
						|
<code>xmlTextReader</code><em>Property</em> , if the return type is an
 | 
						|
<code>xmlChar *</code> string then it must be deallocated with
 | 
						|
<code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
 | 
						|
<em>Property</em> method to the reader class that can be called on the
 | 
						|
instance. The list of the properties is based on the <a
 | 
						|
href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
 | 
						|
XmlTextReader class</a> set of properties and methods:</p>
 | 
						|
<ul>
 | 
						|
  <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
 | 
						|
    element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
 | 
						|
    entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
 | 
						|
    9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
 | 
						|
    fragment and 12 for notation nodes.</li>
 | 
						|
  <li><em>Name</em>: the <a
 | 
						|
    href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
 | 
						|
    name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
 | 
						|
  <li><em>LocalName</em>: the <a
 | 
						|
    href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
 | 
						|
    the node.</li>
 | 
						|
  <li><em>Prefix</em>: a  shorthand reference to the <a
 | 
						|
    href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
 | 
						|
    the node.</li>
 | 
						|
  <li><em>NamespaceUri</em>: the URI defining the <a
 | 
						|
    href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
 | 
						|
    the node.</li>
 | 
						|
  <li><em>BaseUri:</em> the base URI of the node. See the <a
 | 
						|
    href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
 | 
						|
  <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
 | 
						|
    root node.</li>
 | 
						|
  <li><em>HasAttributes</em>: whether the node has attributes.</li>
 | 
						|
  <li><em>HasValue</em>: whether the node can have a text value.</li>
 | 
						|
  <li><em>Value</em>: provides the text value of the node if present.</li>
 | 
						|
  <li><em>IsDefault</em>: whether an Attribute  node was generated from the
 | 
						|
    default value defined in the DTD or schema (<em>unsupported
 | 
						|
  yet</em>).</li>
 | 
						|
  <li><em>XmlLang</em>: the <a
 | 
						|
    href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
 | 
						|
    within which the node resides.</li>
 | 
						|
  <li><em>IsEmptyElement</em>: check if the current node is empty, this is a
 | 
						|
    bit bizarre in the sense that <code><a/></code> will be considered
 | 
						|
    empty while <code><a></a></code> will not.</li>
 | 
						|
  <li><em>AttributeCount</em>: provides the number of attributes of the
 | 
						|
    current node.</li>
 | 
						|
</ul>
 | 
						|
 | 
						|
<p>Let's look first at a small example to get this in practice by redefining
 | 
						|
the processNode() function in the Python example:</p>
 | 
						|
<pre>def processNode(reader):
 | 
						|
    print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
 | 
						|
                           reader.Name(), reader.IsEmptyElement())</pre>
 | 
						|
 | 
						|
<p>and look at the result of calling streamFile("tst.xml") for various
 | 
						|
content of the XML test file.</p>
 | 
						|
 | 
						|
<p>For the minimal document "<code><doc/></code>" we get:</p>
 | 
						|
<pre>0 1 doc 1</pre>
 | 
						|
 | 
						|
<p>Only one node is found, its depth is 0, type 1 indicate an element start,
 | 
						|
of name "doc" and it is empty. Trying now with
 | 
						|
"<code><doc></doc></code>" instead leads to:</p>
 | 
						|
<pre>0 1 doc 0
 | 
						|
0 15 doc 0</pre>
 | 
						|
 | 
						|
<p>The document root node is not flagged as empty anymore and both a start
 | 
						|
and an end of element are detected. The following document shows how
 | 
						|
character data are reported:</p>
 | 
						|
<pre><doc><a/><b>some text</b>
 | 
						|
<c/></doc></pre>
 | 
						|
 | 
						|
<p>We modifying the processNode() function to also report the node Value:</p>
 | 
						|
<pre>def processNode(reader):
 | 
						|
    print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
 | 
						|
                              reader.Name(), reader.IsEmptyElement(),
 | 
						|
                              reader.Value())</pre>
 | 
						|
 | 
						|
<p>The result of the test is:</p>
 | 
						|
<pre>0 1 doc 0 None
 | 
						|
1 1 a 1 None
 | 
						|
1 1 b 0 None
 | 
						|
2 3 #text 0 some text
 | 
						|
1 15 b 0 None
 | 
						|
1 3 #text 0
 | 
						|
 | 
						|
1 1 c 1 None
 | 
						|
0 15 doc 0 None</pre>
 | 
						|
 | 
						|
<p>There are a few things to note:</p>
 | 
						|
<ul>
 | 
						|
  <li>the increase of the depth value (first row) as children nodes are
 | 
						|
    explored</li>
 | 
						|
  <li>the text node child of the b element, of type 3 and its content</li>
 | 
						|
  <li>the text node containing the line return between elements b and c</li>
 | 
						|
  <li>that elements have the Value None (or NULL in C)</li>
 | 
						|
</ul>
 | 
						|
 | 
						|
<p>The equivalent routine for <code>processNode()</code> as used by
 | 
						|
<code>xmllint --stream --debug</code> is the following and can be found in
 | 
						|
the xmllint.c module in the source distribution:</p>
 | 
						|
<pre>static void processNode(xmlTextReaderPtr reader) {
 | 
						|
    xmlChar *name, *value;
 | 
						|
 | 
						|
    name = xmlTextReaderName(reader);
 | 
						|
    if (name == NULL)
 | 
						|
        name = xmlStrdup(BAD_CAST "--");
 | 
						|
    value = xmlTextReaderValue(reader);
 | 
						|
 | 
						|
    printf("%d %d %s %d",
 | 
						|
            xmlTextReaderDepth(reader),
 | 
						|
            xmlTextReaderNodeType(reader),
 | 
						|
            name,
 | 
						|
            xmlTextReaderIsEmptyElement(reader));
 | 
						|
    xmlFree(name);
 | 
						|
    if (value == NULL)
 | 
						|
        printf("\n");
 | 
						|
    else {
 | 
						|
        printf(" %s\n", value);
 | 
						|
        xmlFree(value);
 | 
						|
    }
 | 
						|
}</pre>
 | 
						|
 | 
						|
<h2><a name="Extracting1">Extracting information for the attributes</a></h2>
 | 
						|
 | 
						|
<p>The previous examples don't indicate how attributes are processed. The
 | 
						|
simple test "<code><doc a="b"/></code>" provides the following
 | 
						|
result:</p>
 | 
						|
<pre>0 1 doc 1 None</pre>
 | 
						|
 | 
						|
<p>This proves that attribute nodes are not traversed by default. The
 | 
						|
<em>HasAttributes</em> property allow to detect their presence. To check
 | 
						|
their content the API has special instructions. Basically two kinds of operations
 | 
						|
are possible:</p>
 | 
						|
<ol>
 | 
						|
  <li>to move the reader to the attribute nodes of the current element, in
 | 
						|
    that case the cursor is positioned on the attribute node</li>
 | 
						|
  <li>to directly query the element node for the attribute value</li>
 | 
						|
</ol>
 | 
						|
 | 
						|
<p>In both case the attribute can be designed either by its position in the
 | 
						|
list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or
 | 
						|
by their name (and namespace):</p>
 | 
						|
<ul>
 | 
						|
  <li><em>GetAttributeNo</em>(no): provides the value of the attribute with
 | 
						|
    the specified index no relative to the containing element.</li>
 | 
						|
  <li><em>GetAttribute</em>(name): provides the value of the attribute with
 | 
						|
    the specified qualified name.</li>
 | 
						|
  <li>GetAttributeNs(localName, namespaceURI): provides the value of the
 | 
						|
    attribute with the specified local name and namespace URI.</li>
 | 
						|
  <li><em>MoveToAttributeNo</em>(no): moves the position of the current
 | 
						|
    instance to the attribute with the specified index relative to the
 | 
						|
    containing element.</li>
 | 
						|
  <li><em>MoveToAttribute</em>(name): moves the position of the current
 | 
						|
    instance to the attribute with the specified qualified name.</li>
 | 
						|
  <li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position
 | 
						|
    of the current instance to the attribute with the specified local name
 | 
						|
    and namespace URI.</li>
 | 
						|
  <li><em>MoveToFirstAttribute</em>: moves the position of the current
 | 
						|
    instance to the first attribute associated with the current node.</li>
 | 
						|
  <li><em>MoveToNextAttribute</em>: moves the position of the current
 | 
						|
    instance to the next attribute associated with the current node.</li>
 | 
						|
  <li><em>MoveToElement</em>: moves the position of the current instance to
 | 
						|
    the node that contains the current Attribute  node.</li>
 | 
						|
</ul>
 | 
						|
 | 
						|
<p>After modifying the processNode() function to show attributes:</p>
 | 
						|
<pre>def processNode(reader):
 | 
						|
    print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
 | 
						|
                              reader.Name(), reader.IsEmptyElement(),
 | 
						|
                              reader.Value())
 | 
						|
    if reader.NodeType() == 1: # Element
 | 
						|
        while reader.MoveToNextAttribute():
 | 
						|
            print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
 | 
						|
                                          reader.Name(),reader.Value())</pre>
 | 
						|
 | 
						|
<p>The output for the same input document reflects the attribute:</p>
 | 
						|
<pre>0 1 doc 1 None
 | 
						|
-- 1 2 (a) [b]</pre>
 | 
						|
 | 
						|
<p>There are a couple of things to note on the attribute processing:</p>
 | 
						|
<ul>
 | 
						|
  <li>Their depth is the one of the carrying element plus one.</li>
 | 
						|
  <li>Namespace declarations are seen as attributes, as in DOM.</li>
 | 
						|
</ul>
 | 
						|
 | 
						|
<h2><a name="Validating">Validating a document</a></h2>
 | 
						|
 | 
						|
<p>Libxml2 implementation adds some extra features on top of the XmlTextReader
 | 
						|
API. The main one is the ability to DTD validate the parsed document
 | 
						|
progressively. This is simply the activation of the associated feature of the
 | 
						|
parser used by the reader structure. There are a few options available
 | 
						|
defined as the enum xmlParserProperties in the libxml/xmlreader.h header
 | 
						|
file:</p>
 | 
						|
<ul>
 | 
						|
  <li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li>
 | 
						|
  <li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
 | 
						|
    loading the DTD)</li>
 | 
						|
  <li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
 | 
						|
    the DTD)</li>
 | 
						|
  <li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
 | 
						|
    reference nodes are not generated and are replaced by their expanded
 | 
						|
    content.</li>
 | 
						|
  <li>more settings might be added, those were the one available at the 2.5.0
 | 
						|
    release...</li>
 | 
						|
</ul>
 | 
						|
 | 
						|
<p>The GetParserProp() and SetParserProp() methods can then be used to get
 | 
						|
and set the values of those parser properties of the reader. For example</p>
 | 
						|
<pre>def parseAndValidate(file):
 | 
						|
    reader = libxml2.newTextReaderFilename(file)
 | 
						|
    reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
 | 
						|
    ret = reader.Read()
 | 
						|
    while ret == 1:
 | 
						|
        ret = reader.Read()
 | 
						|
    if ret != 0:
 | 
						|
        print "Error parsing and validating %s" % (file)</pre>
 | 
						|
 | 
						|
<p>This routine will parse and validate the file. Error messages can be
 | 
						|
captured by registering an error handler. See python/tests/reader2.py for
 | 
						|
more complete Python examples. At the C level the equivalent call to ativate
 | 
						|
the validation feature is just:</p>
 | 
						|
<pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre>
 | 
						|
 | 
						|
<p>and a return value of 0 indicates success.</p>
 | 
						|
 | 
						|
<h2><a name="Entities">Entities substitution</a></h2>
 | 
						|
 | 
						|
<p>By default the xmlReader will report entities as such and not replace them
 | 
						|
with their content. This default behaviour can however be overridden using:</p>
 | 
						|
 | 
						|
<p><code>reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)</code></p>
 | 
						|
 | 
						|
<h2><a name="L1142">Relax-NG Validation</a></h2>
 | 
						|
 | 
						|
<p style="font-size: 10pt">Introduced in version 2.5.7</p>
 | 
						|
 | 
						|
<p>Libxml2 can now validate the document being read using the xmlReader using
 | 
						|
Relax-NG schemas. While the Relax NG validator can't always work in a
 | 
						|
streamable mode, only subsets which cannot be reduced to regular expressions
 | 
						|
need to have their subtree expanded for validation. In practice it means
 | 
						|
that, unless the schemas for the top level element content is not expressible
 | 
						|
as a regexp, only chunk of the document needs to be parsed while
 | 
						|
validating.</p>
 | 
						|
 | 
						|
<p>The steps to do so are:</p>
 | 
						|
<ul>
 | 
						|
  <li>create a reader working on a document as usual</li>
 | 
						|
  <li>before any call to read associate it to a Relax NG schemas, either the
 | 
						|
    preparsed schemas or the URL to the schemas to use</li>
 | 
						|
  <li>errors will be reported the usual way, and the validity status can be
 | 
						|
    obtained using the IsValid() interface of the reader like for DTDs.</li>
 | 
						|
</ul>
 | 
						|
 | 
						|
<p>Example, assuming the reader has already being created and that the schema
 | 
						|
string contains the Relax-NG schemas:</p>
 | 
						|
<pre><code>rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))<br>
 | 
						|
rngs = rngp.relaxNGParse()<br>
 | 
						|
reader.RelaxNGSetSchema(rngs)<br>
 | 
						|
ret = reader.Read()<br>
 | 
						|
while ret == 1:<br>
 | 
						|
    ret = reader.Read()<br>
 | 
						|
if ret != 0:<br>
 | 
						|
    print "Error parsing the document"<br>
 | 
						|
if reader.IsValid() != 1:<br>
 | 
						|
    print "Document failed to validate"</code><br>
 | 
						|
</pre>
 | 
						|
 | 
						|
<p>See <code>reader6.py</code> in the sources or documentation for a complete
 | 
						|
example.</p>
 | 
						|
 | 
						|
<h2><a name="Mixing">Mixing the reader and tree or XPath operations</a></h2>
 | 
						|
 | 
						|
<p style="font-size: 10pt">Introduced in version 2.5.7</p>
 | 
						|
 | 
						|
<p>While the reader is a streaming interface, its underlying implementation
 | 
						|
is based on the DOM builder of libxml2. As a result it is relatively simple
 | 
						|
to mix operations based on both models under some constraints. To do so the
 | 
						|
reader has an Expand() operation allowing to grow the subtree under the
 | 
						|
current node. It returns a pointer to a standard node which can be
 | 
						|
manipulated in the usual ways. The node will get all its ancestors and the
 | 
						|
full subtree available. Usual operations like XPath queries can be used on
 | 
						|
that reduced view of the document. Here is an example extracted from
 | 
						|
reader5.py in the sources which extract and prints the bibliography for the
 | 
						|
"Dragon" compiler book from the XML 1.0 recommendation:</p>
 | 
						|
<pre>f = open('../../test/valid/REC-xml-19980210.xml')
 | 
						|
input = libxml2.inputBuffer(f)
 | 
						|
reader = input.newTextReader("REC")
 | 
						|
res=""
 | 
						|
while reader.Read():
 | 
						|
    while reader.Name() == 'bibl':
 | 
						|
        node = reader.Expand()            # expand the subtree
 | 
						|
        if node.xpathEval("@id = 'Aho'"): # use XPath on it
 | 
						|
            res = res + node.serialize()
 | 
						|
        if reader.Next() != 1:            # skip the subtree
 | 
						|
            break;</pre>
 | 
						|
 | 
						|
<p>Note, however that the node instance returned by the Expand() call is only
 | 
						|
valid until the next Read() operation. The Expand() operation does not
 | 
						|
affects the Read() ones, however usually once processed the full subtree is
 | 
						|
not useful anymore, and the Next() operation allows to skip it completely and
 | 
						|
process to the successor or return 0 if the document end is reached.</p>
 | 
						|
 | 
						|
<p><a href="mailto:xml@gnome.org">Daniel Veillard</a></p>
 | 
						|
 | 
						|
<p>$Id$</p>
 | 
						|
 | 
						|
<p></p>
 | 
						|
</body>
 | 
						|
</html>
 |