Data Preparation Area

Creating and Using Data Warehouses

Creating and Using Data Warehouses

Data Preparation Area

Data to be used in the data warehouse must be extracted from the data sources, cleansed and formatted for consistency, and transformed into the data warehouse schema. The data preparation area, sometimes called the data staging area, is a relational database into which data is extracted from the data sources, transformed into common formats, checked for consistency and referential integrity, and made ready for loading into the data warehouse database. The data preparation area and the data warehouse database can be combined in some data warehouse implementations as long as the cleansing and transformation operations do not interfere with the performance or operation of serving the end users of the data warehouse data. Performing the preparation operations in source databases is rarely an option because of the diversity of data sources and the processing load that data preparation can impose on online transaction processing systems. The relational database used for data preparation, regardless of where it is performed, must have powerful data manipulation and transformation capabilities such as those provided by Microsoft® SQL Server™ 2000.

After the initial load of a data warehouse, the data preparation area is used in an ongoing basis to prepare new data for updating the data warehouse. In most data warehouse systems, these ongoing operations are performed on a periodic basis, often scheduled to minimize performance impact on the operational data source systems.

The use of a data preparation area that is separated from the data sources and the data warehouse promotes effective data warehouse management. Attempting to transform data in the data source systems can interfere with online transaction processing (OLTP) performance, and many legacy systems do not have effective or easily implemented transformation capabilities. Reconciliation of inconsistencies in data extracted from various sources can rarely be accomplished until the data is collected in a common database, at which time data integrity errors can more easily be identified and rectified.

The data preparation area should isolate raw data from the data warehouse data to preserve the integrity of the data warehouse and permit it to perform its primary function of preparing information for presentation and supporting access by clients. If the data warehouse database is used for data preparation, care should be taken to avoid introducing errors into the data warehouse data and to minimize the effect of data preparation processing on the performance of the data warehouse. Many data warehouse database operations require sophisticated queries and the processing of large amounts of data; data cleansing can interfere with these operations.

The data preparation area is a relational database that serves as a general work area for the data preparation operations. It will contain tables that relate source data keys to surrogate keys used in the data warehouse, tables of transformation data, and many temporary tables. It will also contain the processes and procedures, such as Data Transformation Services (DTS) packages, that extract data from source data systems.

See Also

Creating and Maintaining Databases Overview

Accessing and Changing Relational Data Overview