SAX to DOM Implementation Notes

MSXML 5.0 SDK

Microsoft XML Core Services (MSXML) 5.0 for Microsoft Office - SAX2 Developer's Guide

SAX to DOM Implementation Notes

You should be aware of the following considerations.

Reading and Modifying the DOMDocument Object

When a DOMDocument object is built from SAX events, the document is locked and cannot be modified until the endDocument method is called. The document can, however, be read anytime after startDocument method is called. The following provides an overview of how the document is locked as it is built.

When the startDocument method is called, a write-lock is imposed on the document; existing data in the document is removed; and the lock is downgraded to a read-lock.

The read-lock is held on the document until one of the following conditions is met:

  • The endDocument method is called.
  • Any of the methods on the errorHandler interface are called.
  • The MXXMLWriter object is released and destructed.
  • There is an error (for example, a call to startElement fails).

Any attempt to modify the document while the read-lock is held will raise an error if the method is invoked from the current thread or block if invoked from a different thread.

Note   For MSXML 5.0, the new parser will always be used for a DOMDocument object connected to receive the output of SAX events. In this scenario, you should be aware that the new parser does not support DTD validation. For example, if you do not first set the validateOnParse property to False on the DOMDocument object, and then try to invoke the SAX lexical handler (ISAXLexicalHandler) by calling startDTD, it will raise an error.

Supported Methods

When a DOMDocument object is set as writer output, only the following methods are supported.

  • get/set output
  • get/set indent
  • flush()

Allowed Sequence of Handler Callbacks

The following sequence of handler callbacks can be invoked.

startDocument
(comment | processingInstruction)*
(startDTD DTD_CONTENT endDTD)?
(comment | processingInstruction)*
(startElement ELEMENT_CONTENT endElement)
(comment | processingInstructions)*
endDocument

DTD_CONTENT :=
elementDecl
| attributeDecl
| internalEntityDecl
| externalEntityDecl
| unparsedEntity
| notationDecl
| startEntity ELEMENT_CONTENT* endEntity
| processingInstruction
| comment

ELEMENT_CONTENT :=
Characters
| ignorableWhitespace
| startCDATA characters* endCDATA
| startElement ELEMENT_CONTENT* endElement
| skippedEntity
| startEntity ELEMENT_CONTENT* endEntity
| processingInstruction
| comment

When building a DOMDocument object, a startDocument event and its corresponding endDocument event must be called. If the callback sequence is not valid (for example, endElement is called without closing an open CDATA section), errorHandler aborts the parse and builds a parseError object from the provided information.

indent Property

When the indent property is set, characters events are scanned to see if the event is purely white space. In this case, it is treated just like a white space event from the old parser. What this means is that when the only event to separate other (non-characters) events is a characters with only white space text, then there is no text node built for the white space, but the white space is remembered (as hints).

For example:

   startElement("","","a",attrs);
   characters("  ");
   endElement("","","a");

normally builds a DOM that looks like this:

   Element: nodeName="a"
       |
   TextNode "  "
which would be output like this:
   <a>  </a>

If the indent property is set, then Element node does not have a text child and the output looks like this:

   <a>
   </a>
Note   The indenting is due to special hints that are stored internally. They are not exposed in any manner, except as indenting when saving.

See Also

MXXMLWriter CoClass | output Property | DOMDocument