Boosting

TagsCS 229Trees

The big idea

Small trees and small models in general are easy to train. However, they have a high bias. How do we get a stronger classifier?

We could add more features, depth, etc. Or, we COULD boosting.

Boosting is the process of combining primitive networks together into ensemble models.

More specifically, we start with some primitive model f1f_1, train it, and then make some updates to the dataset to mark where ff did the worst. Then, we train another model f2f_2 (usually of the same type) and repeat.

Boosting, formalized

We have a bunch of primitive models f1,...fnf_1, ... f_n that output either a -1 if it predicts one class, or a 1 if predicts the other class. We will define as master model as

F(x)=sign(wifi)F(x) = \text{sign}(\sum w_i f_i)

Weighing

There are two sets of weights we need to care about. First, we care about the weights of the contributing models: wiw_i. Second, we care about the weights of the evolving dataset: αi\alpha_i. if each model were trained on the same dataset, then we would just have a bunch of the same models! The key insight here is that we let the performance of the previous model CHANGE how we weigh our next dataset.

Weighing: the αi\alpha_i

Each datapoint xi,yix_i, y_i is weighted by αi\alpha_i. When training, data point ii counts as αi\alpha_i data points. Intuitively, this marks the "hard" spots for the next model to focus on

Weighing: the wiw_i.

We will talk more about this, but essentially the αi\alpha_i is "how importance is this point" and the wiw_i is "how important is this model's contributions?"

The Adaboost algorithm

The adaboost algorithm is a very slick way of provably optimizing for the different weights. Here's the overview

For t = 1, ...T:

  1. learn ftf_t with data weights αi\alpha_i
  1. Compute coefficient wtw_t
  1. recompute weights αi\alpha_i based on how the model performs now

To predict, compute

y=sign(wtft(x))y = \text{sign}(\sum w_t f_t(x))

Computing contribution weights ww

Let's focus on binary classifiers for the setups below.

There's a very intuitive way of calculating wtw_t. if the model had a higher error, then we should care about it less. We should be considering a weighed error with αi\alpha_i weights because then we know if our objective is being reached. We can summarize these points up as the following equation:

wt=12ln(1weightederror(ft)weightederror(ft))w_t = \frac{1}{2}\ln\left(\frac{1 - weightederror(f_t)}{weightederror(f_t)}\right)

Computing dataset weights α\alpha

This is also pretty intuitive. If the model predicted correctly, we should decay the importance of that point by its confidence. In contrast, if the model predicted incorrectly, you should boost the importance of that point by its confidence.

Intuitively, if you're right and you're confident that you're right, you don't need to see that point again. However, if you're wrong and you were confident about it, then you really need to see that point again.

So:

as a quick refresher, the subscript on α\alpha corresponds to the data point, while the subscript on the ww corresponds to the model

Normalization

If we make a mistake on xix_i often, the α\alpha can get very large, and if we are correct, then the αi\alpha_i can get very small. This can cause numerical instability, so we normalize the weights

αi:=αijnαj\alpha_i := \frac{\alpha_i}{\sum_j^n \alpha_j}

Adaboost, visualized

and after 30 iterations, it looks like this:

Adaboost: optimization theory

There is only one condition: at every tt, we must find a weak learner with a weighted error of less than 0.5. This is easy, but not always possible. As an extreme example, if your positive examples and negative examples were EXACTLY the same points, you would not be able to achieve this.

Establishing the upper bound

We can show that

For a hot second, let's forget about the summation. On the left hand side, the loss is 1 if it's not equal. This means that yiy_i and score(xi)score(x_i) did not have the same signs. If we plot yiscore(xi)y_i * score(x_i) vs loss, we'd get a step function as the left hand side (see below).

Now, let's look at the right hand side. That's just an exponential!

So this is why we have the upper bound on the loss.

We call this a surrogate loss, and it's a neat trick whenever you're dealing with non-differentiable losses: just make a worst case, and show that it converges!

How do we do that??

Convergence guarantees

review this

We can actually rewrite things as follows:

Here, we see a change in summation to product, as well as what we are operating across. ZtZ_t corresponds to a model, not a data point.

Now, as long as Zt<1Z_t < 1, the more models you add (increase TT), the closer the upper bound on training error gets to zero. As such, as long as you keep this condition, you are guarenteed to converge!

If you minimize Zt\prod Z_t, we minimize our training error. We can achieve this greedily by minimizing each individual ZtZ_t. If you do the derivation out, you note that

As such, your best bet is to minimize ZtZ_t by changing wtw_t. As it turns out, the optimal solution is

where ϵt\epsilon_t is the weighted error of ftf_t. This explains the adaboost weight update step!

In general, we can describe the optimization as above. If each classifier is at least slightly better than random, then we have ϵt<0.5\epsilon_t < 0.5, which means that the exponenti will always be negative and increasingly negative with increasing TT.

Boosting vs decision trees

At the heart, boosting is like adding nodes to a decision tree, but you have some control of how each node contributes. The additional degrees of freedom makes the boosting algorithm more robust to overfitting

Ensemble models are useful in computer vision and most deployed ML systems use boosting because they perform so well!

Tuning hyperparameters

Other species

Gradient boosting: like Adaboost but generalized to anything

Random forests: pick random subsets of the data, learn a decision tree for each subset, average predictions. Simpler than boosting and easier to parallelize (doesen't need to depend on the previous predictions). However, it typically has higher error than boosting for the same number of trees