White Space and the DOM

MSXML 5.0 SDK

Microsoft XML Core Services (MSXML) 5.0 for Microsoft Office - DOM Developer's Guide

White Space and the DOM

To achieve maximum performance, when a text file is opened with the xmlDoc.load or xmlDoc.loadXML methods (where xmlDoc is an XML DOM document), the parser strips most white space from the file unless specifically directed otherwise; the parser notes within each node whether one or more spaces, tabs, newlines, or carriage returns follow the node in the text by setting a flag.

As a result, a node in an XML DOM document "knows" that at least one occurrence of white space follows or precedes it, but does not know exactly how much white space there was. This method is efficient, reducing both the size of each XML file and the number of calculations required to redisplay the XML in a browser.

However, because this information is lost, an XML document stored in this manner can lose formatting information. Tabs, in particular, can be lost because they are not formally recognized in the default mode as anything but white space.

XSL Transformations (XSLT) uses the XML Document Object Model (DOM), not the source document, to guide its transformation. Because the white space has already been stripped to process the XML into the DOM, white space characters are lost even before the transformation takes place. Most of the XSLT-related methods for specifying white space in the source data document or style sheets are applied too late to make a difference in formatting.

The preserveWhiteSpace property tells the XML parser whether or not to convert white space from the initial source file that acts against the XML DOM. If explicitly set (the default is False), it must always be set prior to loading a file; otherwise, the default is to strip white space characters and reduce the file to the smallest possible stream. When preserveWhiteSpace is set to True, the XML document retains all of the characters within the file when converted into a DOM.

If you set the preserveWhiteSpace property from True to False then back to True for a given DOM document, the spaces will not reappear—setting the property to False actually removes the space from the DOM, which cannot reconstruct it.

If you are working with XML as a data format streamed to some other process, disable preserveWhiteSpace by setting it, or allowing it to default, to False. If retaining positional information is important, for example, in conversions to non-XML formats like tab-separated data, set preserveWhiteSpace to True. Be aware this option increases the number of characters and places more demands on the browser.