Extract Data From a Large Document

MSXML 5.0 SDK

Microsoft XML Core Services (MSXML) 5.0 for Microsoft Office - SAX2 Developer's Guide

Extract Data From a Large Document

This example demonstrates how to create an XML data extracting application in Microsoft® Visual Basic® using MSXML 5.0. The XML Extractor application uses SAX to parse a large XML file, and extracts data from the file to generate multiple DOM documents. The DOM documents are then processed to generate HTML output files. The primary purpose of the application is to show you how SAX and DOM can be used together to perform optimal XML processing. The design of the application emphasizes the respective strengths of both SAX and DOM.

For input, we will use a single XML file, invoices.xml. This file consists of many similar XML trees. It has a document root, <invoices>, which can contain one or more instances of the <invoice> element. Each occurrence of <invoice> contains the data to create an invoice for a patient, billing them for their medical expenses.

For output, the goal of the XML Extractor application is to extract the data for each patient invoice as a separate document that can be used to create the report document, a patient's bill. This is done by applying an XSLT style sheet file, invoice.xsl. The output is an HTML-formatted patient invoice. Each invoice is saved to its own new file, and these files can later be printed and mailed to the patients.

Each invoice is relatively small, so they can be easily processed using only the DOM and XSLT. However, the entire XML document contains numerous invoices and might be too large to load into the DOM. Besides, each bill should be processed separately to generate a separate HTML file to print and send.

The following are some of the advantages of the design for the XML Extractor application.

  • It can use SAX to read a large XML document, and create DOM trees for each <invoice> element. This reuses memory and therefore wastes much less of it.
  • For each <invoice> element, the Extractor generates a small DOM document, which it immediately processes to generate the output. This document is discarded before the following element is read, and the process repeats. This also conserves memory for the application.

This topic is divided into the following sections.