Machine Learning Study Note - Ensemble learning
Last updated: 2017-10-07 18:10:32 PDT.
The No Free Lunch Theorem states that there is no single learning algorithm that in any domain always induces the most accurate learner. If we have multiple leaners at hand, if we combine them properly the accuracy can be improved.
- How do we generate base-learners that complement each other?
- How do we combine the outputs of base-learners for maximum accuracy?
Our discussion in this chapter will answer these two related questions. We will see that model combination is not a trick that always increases accuracy.
Model Combination Schemes
Multiexpert combination where base learners work in parallel. Two cases:
a. global approach, also called learner fusion. All base learners gives an output and all the outputs are used. a. local approach, also called leaner selection. For example, in mix of experts, there is a gating model, which looks at the input and chooses one or very few base learners.
Multistage combination uses a serial approach the next base-learner is tested/trained with only the intances that the previous learner is not confident about. An example is cascading.
The simplest way to combine multiple classifiers is by voting, which is a linear combination of the learners:
This is also known as ensembles and linear opinion pools. Linear combination is not the only possiblity. Other rules include:
- Weighted Sum
If the outputs are not posterior probabilities, these rules require that outputs be normalized to the same scale.
The simple voting is a special case where all voters have equal weight. In classification, this is called plurality voting wher the class having the maximum number of votes is the winner.
Weight sum schemes can be seen as approximations under a Bayesian framework with weights approximating prior model probabilities. This is Bayesian model combination. In classification we have , , and ti becomes
Bagging (Bootstrap aggregating) is a voting method whereby base-learners are made different by training them over slightly different training sets. Normally, we draw samples with replacement in this task, that is, some instances will be used more than one times. A learning algorithm is an unstable algorithmm if small changes in the training set causes a large difference in the generated learner. Bagging generates slightly different training sets to train base-learners using an unstable learning procesure, and then, during testing, takes an average. For robustness, in the case of regression, one can take the median instead.
Decision trees and MLPs are unstable, nearest neighbor is stable, but condensed nearest neighbor is unstable. If the training set is large, we may want to generate training sets of smaller sizes.
In bagging, we generate complementary base-learners by diviersify the training data. In boosting, we actively try to generate complementary base-learners by training the next learner on the mistakes of the previous learners.
The original boosting algorihtm was proposted in 1990 by Schapire. Given 3 weak learners, we can train a strong learner. A weak learner has error probability less than . A strong learner has arbitrarily small error probability. It goes like this:
- (Given 3 weak leaners , , .)
- Split the training set to 3 subsets , , .
- Train on .
- Feed to the trained .
- Train with all the misidentifierd instances in the previous step and some correct instances.
- Feed to and .
- Feed the misidentified instances and train .
During the test, we call and . If they agree, that is the result. Otherwise we return the response of .
In 1996, Freund and Schapire proposed a variant, named AdaBoost (addaptive boosting). That uses the same training set over and over and thus need not to be large, but the classifiers should be simple so that they do not overfit. The following is a variant of AdaBoost called AdaBoost.M1:
- For each base learner
- Randomly draw from with probabilities .
- Train using
- For each , calculate
- Calculate error rate:
- if , weak learner assumption is violated
- for each , reduce the probability of correctly classified samples
- Normalize probability
The success of AdaBoost is due to its property of increasing the margin. If the margin increases, the training instances are better separated and error is less likely. This makes AdaBoosts's aim similar to that of support vector machines.