Cleansing and Transforming Data

Creating and Using Data Warehouses

Creating and Using Data Warehouses

Cleansing and Transforming Data

You can accomplish many data transformations during the process of extracting data from the source systems. However, there are often additional tasks to complete before you can load data into the data warehouse. For example, you must reconcile inconsistent data from heterogeneous data sources after extraction and complete other formatting and cleansing tasks. You should also wait until after the extraction process to incorporate surrogate keys. Some transformations that you might technically accomplish during the extraction process may interfere with the performance or operation of the online source system; you should defer these tasks until after extraction is complete.

After extraction from the source systems, the data should reside in a data preparation area where the cleansing and transformations can be completed before the data is loaded into the data warehouse. The data preparation area can be a separate database or separate tables in the data warehouse database. During the cleansing and transformation phase, you can execute procedures to validate and verify data consistency, transform data into common formats, and incorporate surrogate keys.

You may need to perform manual operations to reconcile data inconsistencies or to resolve ambiguous text field entries. Each time a manual operation is required, you should try to identify a way to eliminate the manual step in future data transformation operations. In some cases, you may be able to modify the source data systems to eliminate the cause at the source. In other cases, you may be able to establish an automated process that will set aside unresolved data for later manual exception processing so the bulk of the data can be loaded into the data warehouse without delay for manual intervention.

Some typical data transformations include:

  • Combining multiple name fields into one field.

  • Breaking down date fields into separate year, month, and day fields.

  • Mapping data from one representation to another, such as TRUE to 1 and FALSE to 0 or postal codes from numeric to text.

  • Mapping data from multiple representations to a single representation, such as a common format for telephone numbers, or different credit rating codes to a common "Good, Average, Poor" representation.

  • Creating and applying surrogate keys for dimension table records.

Some of the tools available in Microsoft® SQL Server™ 2000 for transforming data are:

  • Transact-SQL queries

  • DTS packages

  • Command line applications

  • ActiveX scripts

See Also

Accessing and Changing Relational Data Overview

DTS Overview

Automating Administrative Tasks