Introduction to Data Mining Models

Analysis Services

Analysis Services

Introduction to Data Mining Models

A data mining model is the central object in data mining, one of the new features of Microsoft® SQL Server™ 2000 Analysis Services. A data mining model is a virtual structure that represents the grouping and predictive analysis of relational or multidimensional data. In many aspects, the structure of a data mining model resembles the structure of a database table. However, while a database table represents a collection of records, or a record set, a data mining model represents an interpretation of records as rules and patterns, composed of statistical information, referred to as cases. The structure of the data mining model represents the case set that defines the data mining model, while the data stored represents the rules and patterns learned from processing case data.

To understand what makes up cases and case sets, take for example a database designed to track customer orders. The database may contain a table for customer data, a table for order data, and a table for order items, shown here.

Each piece of information in a given table is a record. For each customer record, there may be one or more order records, each with one or more order item records. The relationship between order records and order item records implies that, for each customer, there may be many records in such a relationship. This collection of related records for a single customer is referred to as a case, and the same collection of related records for a group of customers is referred to as a case set. The order item information is treated as attributes of the customer case.

The case set is simply a way of viewing the physical data; in fact, different case sets can be constructed from the same physical data. The customer case set example is based upon the premise that you want to mine order item information with the customer as the focus. The focus could easily be changed to mining data about the customer with the order item as the focus. The physical data would not change, but a separate data mining model could easily be constructed to reflect the change in focus, with the customer information becoming attributes of the order item case.

Because of the innately hierarchical nature of such information, the data mining model stores the representation of a case set as a collection of data mining columns. Each data mining column can contain a group of data mining columns instead of a single data item such as a string or integer; each data mining column can contain single data items or another group of columns, and so on. In the customer case example, for each customer case, one row describes the customer. This row contains the customer ID and customer information columns, and a column named Order Items. The Order Items column contains a set of rows. Each row describes an order item that relates to the customer specified in the customer row. The following diagram illustrates the structure of such a case set.

In this example, some attributes of the customer, such as age and gender, might be used to further classify and predict the behavior of future customers. One of the most important tasks in data mining is to determine the impact of each of these attributes on classification and prediction.

Training a Data Mining Model

To determine the relative importance of each attribute in a data mining model, the model goes through a process known as mining model training. During training, data is supplied to the model for analysis. The data mining algorithms used by the model then examine the training data set in a variety of ways, to test it so that it can draw some conclusions about classification and prediction of the data.

For example, a decision tree mining model uses a process known as recursive partitioning to split the data up into partitions, based on the attributes supplied by the case set. Then, it splits up these newly created partitions into more partitions, and so on until no more useful splits can be performed. The algorithm itself determines what defines a useful split; this varies from technique to technique.

During this process of recursive partitioning, information is gathered from the attributes used to determine the split. If the Age column is used, for example, the model first divides the age values into two groups: those equal to or greater than a certain age, and those less than a certain age. By analyzing the number of records in the training set that fit one of the categories, a probability can be established for that category. As the splits grow, or increase in depth, more and more probability information can be gathered about the training data. When a decision tree can no longer split a given category usefully, that level of the tree is referred to as a leaf node. The leaf node contains information about the training data that fit that particular path through the decision tree. The information about the training in the leaf node is referred to as a distribution, and it is saved as part of the data mining model.

So, based on the training data set provided, the decision tree mining model establishes certain probabilities about the attributes in the customer case set. Applying those probabilities to other customer data, you can make predictions about customer behavior based upon the distribution information, or content, of the data mining model.

For more information about the data mining algorithms used, see Data Mining Algorithms.

Two objects are used to represent the structure of a case set in the Decision Support Objects (DSO) library. The MiningModel object holds the information about the data mining algorithms, queries and so on needed to describe and analyze the case set, as well as a collection of data mining column objects. In addition to containing information about its data type, each data mining column object holds attributes that describe its use within the data mining model, such as its relation to other data mining columns, whether it is used as a predictable column, whether it holds other columns, whether it is used as input for the data mining process, and so on. These data mining columns are represented by a collection of Column objects in the MiningModel object.

The data mining model is an abstract object; that is, the training data used to construct the case set is not saved. Rather, the abstraction of the model itself is saved, along with the results of the training data analysis, so that the same data mining model can be used with other data fitting its case set to provide predictive analysis.

For more information about the MiningModel object, see clsMiningModel.

For more information about the Column object, see clsColumn.

Integration with OLAP and Relational Data Sources

Data mining models can be trained using data from either an OLAP cube or a relational database. For relational databases, the only requirement is that the provider supports OLE DB. After a mining model has been created and trained, a connection to the original data source for the model is not required. For example, consider the following scenario:

A large telephone company plans to roll out high-speed Internet access in a new market area. From experience in other market areas, the company has determined that persons who purchase high-speed Internet access fit a certain profile. The data that describes this profile is stored in a centrally managed relational database. A mining model is created that includes all of the elements (that is, characteristics) as columns. This model is then trained using the information from the previously existing market areas. This model can then be distributed to the new market areas for batch processing of the customers in that market. Additionally, the same model can be incorporated into the new service call center for the company, where the high-speed Internet service can be marketed to new customers that match that specific profile. In either situation, the original data from the previously existing markets is not needed to make a prediction of the Internet needs of the customer. The model contains within itself all of the information that is needed to make a prediction.