White Space

MSXML 5.0 SDK

Microsoft XML Core Services (MSXML) 5.0 for Microsoft Office - XML Developer's Guide

White Space

The World Wide Web Consortium (W3C) XML specification normalizes different line-ending conventions to a single convention but preserves all other white space, except in attribute values. XML also provides a set of tools that documents can use to signal to applications if white space must be preserved.

White Space and the XML Declaration

According to the current XML 1.0 standard, white space is not allowed before the XML declaration.

<?xml version="1.0"?>
 <BOOK>
  <BOOKNAME>XML</BOOKNAME>
 </BOOK>

If white space appears before the XML declaration, it will be treated as a processing instruction. The information, particularly the encoding, may not be used by the parser.

For more information about the XML declaration, see XML Declaration.

White Space in Element Content

XML parsers are required to report all white space that appears in element content within a document. For this reason, the following three documents are different to an XML parser.

<document>
<data>1</data>
<data>2</data>
<data>3</data>
</document>

and:

<document><data>1</data><data>2</data><data>3</data></document>

and:

<document><data>1</data> <data>2</data> <data>3</data></document>

For some applications, the values of the three data points matter more than the pretty-printing. For document-oriented XML applications, white space preservation can be critical.

Document authors can use the xml:space attribute to identify portions of documents where white space is considered important. Style sheets can also use the xml:space attribute as a hook to preserve white space in presentation. However, because many XML applications do not understand the xml:space attribute, its use is considered advisory.

The xml:space attribute accepts two values.

default
This value allows the application to handle white space as necessary. Not including an xml:space attribute produces the same result as using the default value.
preserve
This value instructs the application to maintain white space as is, suggesting that it might have meaning.

The values of xml:space attributes apply to all descendants of the element containing the attribute unless overridden by one of the child elements.

For example, the following documents specify the same white space behavior.

<poem xml:space="default">
<author>
<givenName>Alix</givenName>
<familyName>Krakowski</familyName>
</author>
<verse xml:space="preserve">
<line>Roses   are  red,</line>
<line>Violets  are  blue.</line>
<signature xml:space="default">-Alix</signature>
</verse>
</poem>

and:

<poem xml:space="default">
<author xml:space="default">
<givenName xml:space="default">Alix</givenName>
<familyName xml:space="default">Krakowski</familyName>
</author>
<verse xml:space="preserve">
<line xml:space="preserve">Roses   are  red,</line>
<line xml:space="preserve">Violets  are  blue.</line>
<signature xml:space="default">-Alix</signature>
</verse>
</poem>

In both examples, the application is notified that all of the white space in the lines of the poem must be preserved, but that white space in other parts of the document can be handled as necessary.

Like its language-indicating counterpart, xml:lang, the xml:space attribute must be declared in a document type definition (DTD) if used in a validating environment. The xml namespace does not need to be declared because it is reserved by the XML specification.

By default, Microsoft XML Core Services (MSXML) 5.0 for Microsoft Office does not honor the xml:space attribute. If an application must honor the xml:space attribute, the preserveWhiteSpace property of the DOMDocument object must be set to True prior to parsing.

xmldoc= new ActiveXObject("Msxml2.DOMDocument.5.0");
xmldoc.preserveWhiteSpace = true;
xmldoc.load(url);

MSXML also provides settings that let you delegate application white space handling to the parser. For more information, see White Space and the DOM.

Note   Preserving white space information can significantly increase the size of Document Object Model (DOM) trees because of the overhead involved in preserving white space nodes between elements.

White Space in Attributes

Although XML processors preserve all white space in element content, they frequently normalize it in attribute values. Tabs, carriage returns, and spaces are reported as single spaces. In certain types of attributes, they trim white space that comes before or after the main body of the value and reduce white space within the value to single spaces. (If a DTD is available, this trimming will be performed on all attributes that are not of type CDATA.)

For example, an XML document might contain the following:

<whiteSpaceLoss note1="this is a note." note2="this
is
a
note.">

An XML parser reports both attribute values as "this is a note.", converting the line breaks to single spaces.

If there is a DTD for the document, attributes that are declared to be of types other than CDATA have spaces removed from the beginning and end of the attribute value; all white space clusters inside the value are replaced with single spaces. If there is no DTD, the parser assumes that all attributes are of type CDATA.

End of Line Handling

XML processors treat the character sequence Carriage Return-Line Feed (CRLF) like single CR or LF characters. All are reported as a single LF character. Applications can save documents using the appropriate line-ending convention.