Statistical Information

Creating and Maintaining Databases

Creating and Maintaining Databases

Statistical Information

Microsoft® SQL Server™ 2000 allows statistical information regarding the distribution of values in a column to be created. This statistical information can be used by the query processor to determine the optimal strategy for evaluating a query. When you create an index, SQL Server automatically stores statistical information regarding the distribution of values in the indexed column(s). The query optimizer in SQL Server uses these statistics to estimate the cost of using the index for a query. Additionally, when the AUTO_CREATE_STATISTICS database option is set to ON (default), SQL Server automatically creates statistics for columns without indexes that are used in a predicate.

As the data in a column changes, index and column statistics can become out-of-date and cause the query optimizer to make less-than-optimal decisions on how to process a query. For example, if you create a table with an indexed column and 1,000 rows of data, all with unique values in the indexed column, the query optimizer considers the indexed column a good way to collect the data for a query. If you update the data in the column so there are many duplicated values, the column is no longer an ideal candidate for use in a query. However, the query optimizer still considers it to be a good candidate based on the index's outdated distribution statistics, which are based on the data before the update.

Note  Out-of-date or missing statistics are indicated as warnings (table name in red text) when the execution plan of a query is graphically displayed using SQL Query Analyzer. For more information, see Graphically Displaying the Execution Plan Using SQL Query Analyzer. Additionally, monitoring the Missing Column Statistics event class using SQL Profiler indicates when statistics are missing. For more information, see Errors and Warnings Event Category.

Therefore, SQL Server automatically updates this statistical information periodically as the data in the tables changes. The sampling is random across data pages, and taken from the table or the smallest nonclustered index on the columns needed by the statistics. After a data page has been read from disk, all the rows on the data page are used to update the statistical information. The frequency at which the statistical information is updated is determined by the volume of data in the column or index and the amount of changing data. For example, the statistics for a table containing 10,000 rows may need updating when 1,000 index values have changed because 1,000 values may represent a significant percentage of the table. However, for a table containing 10 million index entries, 1,000 changing index values is less significant, and so the statistics may not be automatically updated. SQL Server, however, always ensures that a minimum number of rows are sampled; tables that are smaller than 8 megabytes (MB) are always fully scanned to gather statistics.

The cost of this automatic statistical update is minimized by sampling the data, rather than analyzing all of it. Under some circumstances, statistical sampling will not be able to accurately characterize the data in a table. You can control the amount of data that is sampled during manual statistics updates on a table-by-table basis by using the SAMPLE and FULLSCAN clauses of the UPDATE STATISTICS statement. The FULLSCAN clause specifies that all of the data in the table is scanned to gather statistics, whereas the SAMPLE clause can be used to specify either the percentage of rows to sample or the number of rows to sample.

You can also tell SQL Server not to maintain statistics for a given column or index in these ways:

  • Use the sp_autostats system stored procedure.

  • Use the STATISTICS_NORECOMPUTE clause of the CREATE INDEX statement.

  • Use the NORECOMPUTE clause of the UPDATE STATISTICS statement.

  • Use the NORECOMPUTE clause of the CREATE STATISTICS statement.

  • Set the AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS database options to OFF using the ALTER DATABASE statement. For more information, see Setting Database Options.

If you instruct SQL Server not to maintain statistics automatically, you must manually update the statistical information.

Statistics can also be created on all eligible columns in all user tables in the current database in a single statement by using the sp_createstats system stored procedure. Columns not eligible for statistics include nondeterministic or nonprecise computed columns, or columns of image, text, and ntext data types.

The statistics generated for a column can be deleted if you no longer want to retain and maintain them. Statistics created on columns by SQL Server (when the AUTO_CREATE_STATISTICS database option is set to ON) are aged and dropped automatically.

Creating statistics manually allows you to create statistics that contain multiple column densities (average number of duplicates for the combination of columns). For example, a query contains the clause:

WHERE a = 7 and b = 9

Creating manual statistics on both columns together (a, b) can allow SQL Server to make a better estimate for the query because the statistics also contain the average number of distinct values for the combination of columns a and b.

To create statistics on a column

Transact-SQL