# Bei's Study Notes

Back to Intro

## Decision Tree

It is a classic and natural model of learning. Closely related to the notion of "divide and conquer". Decision tree A tree structure that on each non-leaf node, there is a binary predicate that test the data, The delegate the computation to one of the sub-tree base on this predication. When we reached to leaf node, returns the result.

## Univariate Trees

In a univariate tree, in each internal node, the test uses only one of the input dimensions. Tree learning algorithms are greedy and, at each step, starting at the root with the complete training data, we look for the best split. We then continue splitting recursively with the corresponding subset until we do not need to split anymore, at which point a leaf node is created and labeled.

### Classification Trees

The goodness of the split is quantified by an impurity measure. A split is pure if after the split, all the instances choosing a branch belong to the same class. Given node , is the number of training instances reaching node . For the root, it is . Let of belong to class , with . The probablity of class given an instance reaches node is:

Node is pure if for all are either 0 or 1. One possbile measure is entropy :

where (Indeed, ). In general, is a valid measurement if:

1. . or simply means .
2. and , , or in other words, uniform distribution has maximum entropy.
3. , or in other words, deterministic events have zero entropy.

Examples are:

1. Entropy

2. Misclassification error

Research has shown that there is not a significant difference between these three measures.

If node is not pure, then it should be split to increase purity. We look for the split that minimizes the impurity after the split. This is locally minimal and we have no guarantee of finding purest decision tree.

Let be the number of training instances that take branch after a split, then

The total impurity after the split is

Therefore, the new entropy is entropy of the subdivided cases minus the entropy of the subdivision itself.

We thus obtained an algorithm to generate a decision tree:

theta_i     # hyper paramater on complexity, or the maximum impurity of a leaf
attributes  # the attribute[i](x) on the ith attribute of sample x

def generateTree(samples):
if entropy(samples) < theta_i:
return Leaf(mode(samples))

split, subdivisions = splitAttribute(samples)
node = Internal(split)
for subdivision in subdivisions:
node.children.append(generateTree(subdivision))

def splitAttribute(samples):
minEntroy = math.inf
for attribute in attributes:
if attribute is DiscreteAttribute:
subdivisions = # subdivide samples by attribute[i]
e = splitEntropy(subdivision, len(samples))
if e < minEntroy:
minEntroy = e
result = (attribute, subdivisions)
else: # attribute is continous
splits = # find all possible splits
for split in splits:
subdivisions = # subdivide samples by split
e = splitEntropy(subdivision, len(samples))
if e < minEntroy:
minEntroy = e
result = (attribute, subdivisions)

return result

This is the basis of the Classification and Regression Trees (CART), ID3, and C4.5. This method has bias towards discrete values with many values, which has lower entropy.

When there is noise, the tree may easily overfit. Therefore a hyperparameter is added to prevent over-splitting.

Some times the posterior probabilities of each class is stored in the leaf instead of the maximal likely class.

### Regression Trees

A regression tree is constructed in the same manner, except that the impurity measure is replaces with regression loss.

Let be the supervision of sample .

Let be the training instances that reached m,

Let be the mean (or median if too noisy) supervision reached node :

The "urge" of a split is measured by mean square error from the estimated value:

If is acceptable, we create a leaf node and stores on the node. Otherwise we split. Let be a subdivision of , accordingly:

The new loss

In a more advanced model, could be in fact a function about :

and the tree will store and instead.

### Pruning

Frequently, the stopping condition on splitting is based on the number of reached instances. The idea is that any decision based on too few instances causes variance. Stopping tree construction early is called prepruning the tree.

Another possiblity to get simpler trees is postpruning, which in practice works bettern than prepruning.

My thought: Because this is when we can use global information to adjust the model.

In postpruning we holdout a part of the training set as "pruning set". Then we grow the tree with rest of the training set until all leaves are pure and we have no training error. Then we use the pruning set to find subtrees that cause overfitting and prune them by replacing the subtree with a leaf node if the performance of the leaf node is comparable to the subtree on the pruning set.