Microsoft Decision Trees

Analysis Services

Analysis Services

Microsoft Decision Trees

The Microsoft® Decision Trees algorithm is based upon the notion of classification. The algorithm builds a tree that will predict the value of a column based upon the remaining columns in the training set. Therefore, each node in the tree represents a particular case for a column. The decision on where to place this node is made by the algorithm, and a node at a different depth than its siblings may represent different cases of each column. For instance, consider the following training table.

Shares files Uses scanner Infected before Risk
Yes Yes No High
Yes No No High
No No Yes Medium
Yes Yes Yes Low
Yes Yes No High
No Yes No Low
Yes No Yes High

For this training data, the following decision tree may be produced.

Notice that for users that share files, the most important factor (that is to say, training column) for determining their risk of computer virus infect is Infected Before. For users who don't share files, the most important factor is Uses Scanner. This demonstrates on of the key concepts behind the decision tree algorithm: A column may be used at more than one location in the tree, and its importance in the prediction may therefore change.

Mining Parameters

The Microsoft Decision Trees algorithm provider currently supports two mining parameters, which can be used to change the behavior of the algorithm when creating a model with the CREATE MINING MODEL command. The parameters are defined in the MINING_PARAMETERS schema rowset; a description of each parameter is provided in the following table.

Parameter Description
COMPLEXITY_PENALTY A floating point number with a range between 0 and 1. Used to inhibit the growth of the decision tree, the value is subtracted from 1 and used as a factor in determining the likelihood of a split. The deeper the branch of a decision tree, the less likely a split becomes; the complexity penalty influences that likelihood. A low complexity penalty increases the likelihood of a split, while a high complexity penalty decreases the likelihood of a split. The effect of this mining parameter is dependent on the mining model itself; some experimentation and observation may be required to accurately tune the data mining model.

The default value is based on the number of attributes for a given model:

  • For 1 to 9 attributes, the value is 0.5.

  • For 10 to 99 attributes, the value is 0.9.

  • For 100 or more attributes, the value is 0.99.

MINIMUM_LEAF_CASES A non-negative integer with a range of 0 to 2,147,483,647. Determines the minimum number of leaf cases required to generate a split in the decision tree. A low value causes more splits in the decision tree, but can increase the likelihood of overfitting. A high value reduces the number of splits in the decision tree, but can inhibit the growth of the decision tree. The default value is 10.

See Also

MINING_SERVICE_PARAMETERS

CREATE MINING MODEL Statement