Loading Data into the Data Warehouse Database

Creating and Using Data Warehouses

Creating and Using Data Warehouses

Loading Data into the Data Warehouse Database

After the data has been cleansed and transformed into a structure consistent with the data warehouse requirements, data is ready for loading into the data warehouse. You may make some final transformation during the loading operation, although you should complete any transformations that could identify inconsistencies before the final loading operation.

The initial load of the data warehouse consists of populating the tables in the data warehouse schema and then verifying that the data is ready for use. You can use various methods to load the data warehouse tables, such as:

  • Transact-SQL

  • DTS

  • bcp utility

When you load data into the data warehouse, you are populating the tables that will be used by the presentation applications that make the data available to users. Loading data often involves the transfer of large amounts of data from source operational systems, a data preparation area database, or preparation area tables in the data warehouse database. Such operations can impose significant processing loads on the databases involved and should be accomplished during a period of relatively low system use.

After the data has been loaded into the data warehouse database, verify the referential integrity between dimension and fact tables to ensure that all records relate to appropriate records in other tables. You should verify that every record in a fact table relates to a record in each dimension table that will be used with that fact table. For example, if a fact table of product sales is to be used with dimension tables for customers, products, time, and stores, then for each sale record in the fact table there must be a record in each dimension table that relates to the sale record through correspondence of primary keys. This verification ensures that for every sale, the customer who made the purchase, the product sold, and the time and location of the sale are identified.

Data integrity in the reverse order is not necessary. That is, it is not necessary for every record in a dimension table to relate to a record in the fact table. For example, dimensions in a sales data warehouse typically are shared dimensions, which contain the full sets of customers, products, stores, and so on. A fact table may contain sales records for a specific time period during which some customers did not make any purchases and some products were not sold.

Most queries that retrieve data from the data warehouse use inner joins between the fact and dimension tables. Such queries will ignore facts for which at least one of the joined dimension tables does not contain a matching record, causing retrieved data to be inaccurate and possibly inconsistent among different queries. For example, if a customer record is missing for a particular sales fact record, any query that includes the customer dimension table will ignore the sales fact record, but any query that does not include the customer dimension table will contain the sales fact record. A query that computes the sum of sale amounts by customer will yield a different grand total than a query that computes the sum of sale amounts by product, because the first query ignores the sale for which there is no customer and the second query includes it.

If you use a dimension table containing data that does not apply to all facts, you must include a record in the dimension table that can be used to relate to the remaining facts. For example, in a table of sales promotion records, you can include a generic record that you can use to relate to any sales fact for which there is no applicable sales promotion. Without this generic promotion record any query that joins the promotion table to the sales fact table will not include sales for which there is no corresponding promotion.

To verify referential integrity in a star schema you can use a simple SQL query that counts the rows returned when all appropriate dimension tables are joined to the fact table using inner joins. The number of rows returned by this query should match the number of rows in the fact table. If you are using a snowflake schema, you should also verify referential integrity between dimension tables and the subordinate tables to which they are linked to verify that no records in any table are eliminated by inner joins to subordinate tables. You should perform this verification by starting with the tables at the lowest level of the snowflake dimension and joining them to the tables at the next higher level, continuing until the primary dimension table has been verified. This is an important step because there can be situations in which the dimension may verify correctly against the current fact table, even though some dimension records are missing; these records will be needed when new facts are added.

See Also

Accessing and Changing Relational Data Overview