The ML classes discussed in this section implement Classification and Regression Tree algorithms described in [Breiman84].
The class CvDTree represents a single decision tree that may be used alone or as a base class in tree ensembles (see Boosting and Random Trees ).
A decision tree is a binary tree (tree where each non-leaf node has two child nodes). It can be used either for classification or for regression. For classification, each tree leaf is marked with a class label; multiple leaves may have the same label. For regression, a constant is also assigned to each tree leaf, so the approximation function is piecewise constant.
To reach a leaf node and to obtain a response for the input feature vector, the prediction procedure starts with the root node. From each non-leaf node the procedure goes to the left (selects the left child node as the next observed node) or to the right based on the value of a certain variable whose index is stored in the observed node. The following variables are possible:
So, in each node, a pair of entities (variable_index , decision_rule (threshold/subset) ) is used. This pair is called a split (split on the variable variable_index ). Once a leaf node is reached, the value assigned to this node is used as the output of the prediction procedure.
Sometimes, certain features of the input vector are missed (for example, in the darkness it is difficult to determine the object color), and the prediction procedure may get stuck in the certain node (in the mentioned example, if the node is split by color). To avoid such situations, decision trees use so-called surrogate splits. That is, in addition to the best “primary” split, every tree node may also be split to one or more other variables with nearly the same results.
The tree is built recursively, starting from the root node. All training data (feature vectors and responses) is used to split the root node. In each node the optimum decision rule (the best “primary” split) is found based on some criteria. In machine learning, gini “purity” criteria are used for classification, and sum of squared errors is used for regression. Then, if necessary, the surrogate splits are found. They resemble the results of the primary split on the training data. All the data is divided using the primary and the surrogate splits (like it is done in the prediction procedure) between the left and the right child node. Then, the procedure recursively splits both left and right nodes. At each node the recursive procedure may stop (that is, stop splitting the node further) in one of the following cases:
When the tree is built, it may be pruned using a cross-validation procedure, if necessary. That is, some branches of the tree that may lead to the model overfitting are cut off. Normally, this procedure is only applied to standalone decision trees. Usually tree ensembles build trees that are small enough and use their own protection schemes against overfitting.
Besides the prediction that is an obvious use of decision trees, the tree can be also used for various data analyses. One of the key properties of the constructed decision tree algorithms is an ability to compute the importance (relative decisive power) of each variable. For example, in a spam filter that uses a set of words occurred in the message as a feature vector, the variable importance rating can be used to determine the most “spam-indicating” words and thus help keep the dictionary size reasonable.
Importance of each variable is computed over all the splits on this variable in the tree, primary and surrogate ones. Thus, to compute variable importance correctly, the surrogate splits must be enabled in the training parameters, even if there is no missing data.
The structure represents a possible decision tree node split. It has public members:
Index of variable on which the split is created.
If it is not null then inverse split rule is used that is left and right branches are exchanged in the rule expressions below.
The split quality, a positive number. It is used to choose the best primary split, then to choose and sort the surrogate splits. After the tree is constructed, it is also used to compute variable importance.
Pointer to the next split in the node list of splits.
Bit array indicating the value subset in case of split on a categorical variable. The rule is:
if var_value in subset
then next_node <- left
else next_node <- right
The threshold value in case of split on an ordered variable. The rule is:
if var_value < ord.c
then next_node<-left
else next_node<-right
Used internally by the training algorithm.
The structure represents a node in a decision tree. It has public members:
Class index normalized to 0..class_count-1 range and assigned to the node. It is used internally in classification trees and tree ensembles.
Tree index in a ordered sequence of pruned trees. The indices are used during and after the pruning procedure. The root node has the maximum value Tn of the whole tree, child nodes have Tn less than or equal to the parent’s Tn, and nodes with are not used at prediction stage (the corresponding branches are considered as cut-off), even if they have not been physically deleted from the tree at the pruning stage.
Value at the node: a class label in case of classification or estimated function value in case of regression.
Pointer to the parent node.
Pointer to the left child node.
Pointer to the right child node.
Pointer to the first (primary) split in the node list of splits.
The number of samples that fall into the node at the training stage. It is used to resolve the difficult cases - when the variable for the primary split is missing and all the variables for other surrogate splits are missing too. In this case the sample is directed to the left if left->sample_count > right->sample_count and to the right otherwise.
Depth of the node. The root node depth is 0, the child nodes depth is the parent’s depth + 1.
Other numerous fields of CvDTreeNode are used internally at the training stage.
The structure contains all the decision tree training parameters. You can initialize it by default constructor and then override any parameters directly before training, or the structure may be fully initialized using the advanced variant of the constructor.
The constructors.
Parameters: |
|
---|
The default constructor initializes all the parameters with the default values tuned for the standalone classification tree:
CvDTreeParams() : max_categories(10), max_depth(INT_MAX), min_sample_count(10),
cv_folds(10), use_surrogates(true), use_1se_rule(true),
truncate_pruned_tree(true), regression_accuracy(0.01f), priors(0)
{}
Decision tree training data and shared data for tree ensembles. The structure is mostly used internally for storing both standalone trees and tree ensembles efficiently. Basically, it contains the following types of information:
There are two ways of using this structure. In simple cases (for example, a standalone tree or the ready-to-use “black box” tree ensemble from machine learning, like Random Trees or Boosting ), there is no need to care or even to know about the structure. You just construct the needed statistical model, train it, and use it. The CvDTreeTrainData structure is constructed and used internally. However, for custom tree algorithms or another sophisticated cases, the structure may be constructed and used explicitly. The scheme is the following:
The class implements a decision tree as described in the beginning of this section.
Trains a decision tree.
There are four train methods in CvDTree:
The function is parallelized with the TBB library.
Returns the leaf node of a decision tree corresponding to the input vector.
Parameters: |
|
---|
The method traverses the decision tree and returns the reached leaf node as output. The prediction result, either the class label or the estimated function value, may be retrieved as the value field of the CvDTreeNode structure, for example: dtree->predict(sample,mask)->value.
Returns error of the decision tree.
Parameters: |
|
---|
The method calculates error of the decision tree. In case of classification it is the percentage of incorrectly classified samples and in case of regression it is the mean of squared errors on samples.
Returns the variable importance array.
Returns the root of the decision tree.
Returns the CvDTree::pruned_tree_idx parameter.
The parameter DTree::pruned_tree_idx is used to prune a decision tree. See the CvDTreeNode::Tn parameter.
Returns used train data of the decision tree.
Example: building a tree for classifying mushrooms. See the mushroom.cpp sample that demonstrates how to build and use the decision tree.
[Breiman84] | Breiman, L., Friedman, J. Olshen, R. and Stone, C. (1984), Classification and Regression Trees, Wadsworth. |