Generating Well-Formed HTML Using XSLT

MSXML 5.0 SDK

Microsoft XML Core Services (MSXML) 5.0 for Microsoft Office - XSLT Developer's Guide

Generating Well-Formed HTML Using XSLT

Well-formed HTML conforms to the rules of XML. This means that the same HTML tags are applicable, but the stricter XML syntax is required. For example, <BR> is not a well-formed HTML tag, but <BR/> is. <H1>...</h1> is not well-formed, but <H1>...</H1> or <h1>...</h1> is. An XSLT style sheet is itself XML, and it is important that any HTML within it be well-formed. The following are some basic rules to follow as you write or convert to well-formed HTML.

All tags must be closed

HTML allows certain end tags to be optional, such as <P>, <LI>, <TR>, and <TD>. XML requires all tags to be closed explicitly.

HTML Well-formed HTML
<P> This is an HTML paragraph. <P>or two. <P>This is an HTML paragraph.</P>
<P>or two.</P>

Leaf nodes must also be closed by placing a forward slash (/) within the tag: <BR/>, <HR/>, <INPUT/>, and <IMG/>.

HTML Well-formed HTML
<IMG src="sample.gif"
width="10" height="20">
<IMG src="sample.gif"
width="10" height="20"/>

No overlapping tags

XML does not allow start and end tags to overlap, but enforces a strict hierarchy within the document.

HTML Well-formed HTML
<B>Bold <I>Bold and Italic</B> Italic</I> <B>Bold</B> <I><B>Bold and Italic</B> Italic</I>

Case matters

Choose a consistent case for start and end tags. The examples in this SDK generally use uppercase for HTML elements.

HTML Well-formed HTML
<B><i>Hello!</I></b> <B><I>Hello!</I></B>

Quote your attributes

All attributes must be surrounded by quotation marks, either single or double.

HTML Well-formed HTML
<IMG src=sample.gif
width=10 height=20 >
<IMG src='sample.gif'
     width="10" height="20" />

Use a single root

Shortcuts that eliminate the <HTML> element as the single top-level element are not allowed.

HTML Well-formed HTML
<TITLE>Shortcut markup</TITLE>
<BODY>
<P>Amazing that this HTML works.</P>
</BODY>
<HTML>
<HEAD>
<TITLE>Clean markup</TITLE>
</HEAD>
<BODY>
<P>Not nearly so amazing that
this well-formed HTML works.</P>
</BODY>
</HTML>

Use fewer named entities

XML defines only a minimal set of built-in named entities. These are as follows:

  • &lt; — (<)
  • &gt; — (>)
  • &amp; — (&)
  • &quot; — (")
  • &apos; — (')

Therefore, you should avoid using other named HTML entitiesOther Resources. When in doubt, always use the numeric character reference for the character of interest. For example, for non-breaking spaces, use &#160; or &#xA0; instead of &nbsp;. For emphatic dashes, use &#8212; instead of &mdash;. Also, in Internet Explorer, the numeric character reference of &#151; is treated as the named entity of &mdash;, but will not be resolved to &#8212; in MSXML.

Escape script blocks

Script blocks in HTML can contain characters that cannot be parsed, namely < and &. These must be escaped in well-formed HTML by using character entities, or by enclosing the script block in a CDATA section.

In addition, Microsoft® JScript® (compatible with the ECMA 262 language specification) comments terminate at the end of the line, so it is important to preserve the white space within script blocks containing comments. By default, the xml:space attribute value normalizes white space by compressing adjacent white space characters into a single space. This destroys the new line that terminates the JScript comment. Any JScript following the comment is treated as part of the comment and ignored, often resulting in script errors. The CDATA notation also ensures that the white space is preserved.

The following HTML script block contains both an unparsable character (<) and JScript comments. The well-formed script block uses CDATA to encapsulate the script.

HTML Well-formed HTML
<SCRIPT>
// checks a number against 7
function less-than-seven(n) {
return n < 7;
}
</SCRIPT>
<SCRIPT><![CDATA[
// checks a number against 7
function less-than-seven(n) {
return n < 7;
}
]]></SCRIPT>

Not all scripts will fail if they are not escaped in this way. However, it is highly recommended that you habitually escape them. This ensures not only that the script will work if it contains escaped characters or comments now, but that it will continue to work if these characters are added in the future.