Training a Local Data Mining Model

Analysis Services Programming

Analysis Services Programming

Training a Local Data Mining Model

In data mining, training is the process that inserts the data into the model that will be used as the basis for making predictions. The INSERT INTO statement is used to accomplish this task. The syntax of the statement depends on the kind of object on which the model is to be based. Microsoft® SQL Server™ 2000 Analysis Services supports two different kinds of base objects for data mining models: OLAP cubes and relational tables.

The process of training a mining model can be broken down into two parts. First, the columns that define the model's structure are populated with content from the data source. Second, the content is analyzed using the algorithm specified in the CREATE MINING MODEL statement. The results of this analysis are stored in the mining model as a collection of nodes. These nodes can be browsed using the MINING_MODEL_CONTENT schema rowset or by executing a content query against the mining model.

Training an OLAP Data Mining Model

For OLAP mining models, the general form for the INSERT INTO statement is:

INSERT INTO <model Name>

No column names or other source data is needed to train the model. This is because the structure of the mining model is based on a cube and is therefore known in advance. No other steps are necessary to complete the processing of the model.

Training a Relational Mining Model

Training a data mining model based on a table in a relational database is slightly more complicated than processing an OLAP mining model. When training a relational data mining model, the columns to be populated must be specified explicitly along with their data source. This is because the INSERT INTO command in the relational data mining model does not have the same information available as an OLAP mining model. The general form of the command for training relational mining models is as follows:

INSERT INTO <model name> (<Column Names>) <Data>

To understand the process of training a relational mining model, consider the example of a model with the columns Name, Age, and Hair Color. The following statement can be used to populate this model:

INSERT INTO [MyModel]
  // Define the list of columns to be populated
  (
    [Name], [Age], [Hair Color]
  )
  // Use the OPENROWSET command to pass a SELECT query to an SQL OLE DB provider
  OPENROWSET 
  ( 
    'SQLOLEDB', 'Initial Catalog=FoodMart 2000', 
    'Select [Name], [Age], [Hair Color] FROM [Customers]'
  )
The OPENROWSET Statement

Analysis Services does not support the use of direct SQL SELECT queries to retrieve data rowsets for training data mining models. Instead, it supports the OPENROWSET statement, which enables applications to specify an external query in place of actual data or an SQL SELECT statement. The syntax of this command is as follows:

OPENROWSET ( '<Provider Name>', '<Connection String>', '<Query Syntax>' )

The <Provider Name> token must correspond to an OLE DB compliant data source, such as 'SQLOLEDB' or 'MSOLAP'. The <Connection String> token must correspond to a valid connection string for the data source, minus the provider property. Finally, the <Query Syntax> token should correspond to a valid query in the supported language of the provider that will return the desires rowset. In this example, the provider used is the SQL Server 2000 OLE DB provider. The Connection String property specifies that the FoodMart 2000 database is to be used as the default database for the query. Finally, the query itself is defined as a standard Transact-SQL query that returns three columns from a table called Customers.

Training Models That Include Nested Columns

The SHAPE command must be used to populate the columns in a nested table. The general format for this command is as follows:

SHAPE { <Rowset Query> }
APPEND
(
   { <Rowset Query> }
   RELATE <Parent Key Column> TO <Child Key Column>
)
AS <Nested Column Name>

The following example demonstrates populating a nested table by using the shape provider:

INSERT INTO [Age Prediction]
  (  /* Define the columns of the case table */
    [Customer Id], [Gender], [Age],
    /* Define the columns of the nested table */
    [Product Purchases](SKIP, [Product Name], [Quantity], [Product Type])
  )
SHAPE 
  {
    OPENROWSET ('SQLOLEDB','INITIAL CATALOG=FoodMart 2000;',
    'SELECT [Customer Id], [Gender], [Age] FROM Customers ORDER BY [Customer ID]')
   } 
  APPEND 
   (
    {
      OPENROWSET ('SQLOLEDB','INITIAL CATALOG=FoodMart 2000;',
      SELECT [CustID], [Product Name], [Quantity], [Product Type] FROM Sales ORDER BY [CustID])
    } 
     RELATE [Customer Id] To [CustID]
  ) 
   AS [Product Purchases]

In this example, a column in the case table called [Product Purchases] is populated by a nested table. The names of the columns in this nested table are defined inside the parentheses of the fourth line of the INSERT INTO statement. The SHAPE command is then used to define the columns that will be used to populate the case table. The SHAPE clause defines columns that will be used to populate the nested table as columns that are contained within the APPEND clause. The relationship between the case table and the nested table is then defined by using the RELATE clause. The result of the SHAPE command is then aliased to be the same as that of the original column that contained the nested tale.

When using the shape command it is important to use the ORDER BY clause to enforce the order of columns in the query. Failure to use this clause may cause some or all data to be ignored in your nested tables.

See Also

Data Mining Models

Data Mining Columns