Distribution Statistics Requirements for OLE DB Providers
Microsoft® SQL Server™ 2000 defines extensions to the OLE DB specification that allow OLE DB providers to report statistics on numbers of rows and ranges of key values in the data they provide. SQL Server can use this information to increase the performance of distributed queries.
SQL is a nonprocedural language. SQL statements do not specify the steps needed to accomplish the result you want. The statements define the format of the result set and the conditions rows in base tables must meet to be used in building the result set. The database engine must analyze each SQL statement and determine the most efficient way to access the base tables. The part of the database engine that performs this task is called the optimizer.
The results of the optimization process is improved if the optimizer has access to statistics describing the distribution of the values in base table columns referenced in WHERE clause predicates. The distribution statistics used by the optimizer include:
- The number of rows in a table, also called the cardinality of the table.
- The number of distinct values stored in a column, also called the cardinality of the column.
- Information about how the distinct values in a column are distributed across the rows of the table.
To improve the optimization of distributed queries, SQL Server defines extensions to the OLE DB specification that OLE DB providers can use to report distribution statistics on the rowsets, or tables, they expose. While these extensions are defined in the SQL Server documentation, individual OLE DB provider developers must code support for the extensions in their providers if they want to make the information available to SQL Server. If a provider has code that supports the extensions, SQL Server can use the extensions to optimize the performance of distributed queries. If a provider does not support the extensions, SQL Server uses simple estimates of the distribution statistics.
Note The Microsoft OLE DB Provider for SQL Server and the Microsoft OLE DB Provider for Oracle support distribution statistics.
The distribution statistics extensions are built around a unit called a statistic. Each table can have zero or more statistics, and each statistic reports data for one or more columns. A statistic records:
- The cardinality of the values, or the number of unique values, in each individual column covered by the statistic.
- The cardinality of the concatenated values of all the columns covered by the statistic.
- Optionally, a histogram reporting information about different ranges of key values in the first column covered by the statistic. The values reported can include the number of rows in each key range, the number of unique values in each key range, or the number of rows in the table whose key values are less than or equal to the highest key value in the range.
Here is an example table.
ColumnA | ColumnB |
'abc' | 'xyz' |
'abc' | 'xyz' |
'def' | 'xyz' |
'mno' | 'xyz' |
'mno' | 'mmm' |
'tuv' | 'xyz' |
If a statistic covers ColumnA and ColumnB, the cardinality of the combined values of the two columns is 5 because the first two rows have the same value ('abc' + 'xyz') for the combination of ColumnA and ColumnB. The cardinality of ColumnA is 4 and the cardinality of ColumnB is 2. A simple, 4-step histogram on ColumnA could report.
Value range | Percentage of table rows in the range |
'aaa' to 'hzz' | 50% |
'iaa' to 'nzz' | 33% |
'oaa' to 'rzz' | 00% |
'taa' to 'zzz' | 17% |
Different OLE DB data sources record distribution statistics on different combinations of columns, and the set of statistics reported by an OLE DB provider is implementation defined. For example, SQL Server versions 6.5 or earlier build distribution statistics only for columns covered by indexes and have one statistic for each index defined on a table. SQL Server version 7.0 and later builds these statistics:
- One statistic for each index defined on a table.
- One statistic for each CREATE STATISTIC statement.
- One statistic for each statistic generated automatically. For more information, see Statistical Information.
A column has a high degree of selectivity if it is likely to return a small number of rows for a given value specified in a predicate argument. The distribution statistics can be used to estimate the degree of selectivity:
- Columns with high cardinality have more data values, and each data value is likely to match a smaller number of rows than a column with low cardinality.
- If an OLE DB provider provides a histogram reporting how the values are distributed in a column, the SQL Server optimizer can also estimate if the specific value in a predicate argument is in a range that has good or poor selectivity.
Having good distribution statistics for a linked server can also help the optimizer build an efficient execution plan for the local part of a distributed query.
The SQL Server optimizer uses the distribution statistics in an attempt to reduce the amount of data that must be communicated between the OLE DB provider and SQL Server. For example, when performing a distributed join between TableA on the local server and TableB on a linked server, SQL Server can use the distribution statistics to determine which of these processes is most efficient:
- Send the rows from TableA that match non-join predicates to the linked server and have the linked server perform the join.
- Retrieve the rows from TableB that match non-join predicates to the local server and perform the join on the local server.
If an OLE DB provider does not report cardinality information about a column, the SQL Server optimizer estimates a low cardinality. If a provider does not report a distribution histogram for a statistic, the optimizer operates as if the values are evenly distributed in the rows of the table.
SQL Server uses the following extensions from OLE DB providers to report distribution statistics:
- A new data source property, DBPROP_TABLESTATISTICS, indicates if the provider reports distribution statistics.
- A new IDBSchemaRowset, TABLE_STATISTICS, lists the statistics available for a given base table, including column and row cardinality.
- IOpenRowset::OpenRowset is enhanced to accept new arguments identifying a statistic. When a statistic is specified, OpenRowset returns a histogram rowset showing the distribution of values in the first column covered by the statistic specified in StatisticID.
These extensions to OLE DB are included in OLE DB version 2.6. For information about these extensions regarding distribution statistics, see the OLE DB 2.6 specification.
An OLE DB provider can choose to implement a performance enhancement of sampling only a part of the rows in a base table to determine the distribution statistics and histograms. These providers should scale their cardinality and histogram data to reflect the total values for the table before reporting them in the TABLE_STATISTICS and histogram rowsets.
Whether or not an OLE DB provider keeps the data in the TABLE_STATISTICS and the histogram rowset up-to-date with the current contents of the base table is implementation defined.