ExactBoost

Work done in collaboration with Carolina Piazza, Thiago Ramos, João Vitor Romano, Roberto Imbuzeiro and Paulo Orenstein.

In machine learning we are usually concerned with optimizing reasonably neat, differentiable, decomposable losses: mean squared error, binary cross-entropy, etc. But this isn’t enough! There are important metrics we’d ultimately like to optimize (and which are very frequent in practice) which are neither differentiable nor decomposable.

In this work, we focus on three such losses for binary classification: AUC (-ROC), KS and P@kk (Precision ‘at’ kk):

AUC^(S,y)=11n0yi=11n1yj=01{Sj<Si}KS^(S,y)=1maxtR(1n0j=1yj=01{Sjt}1n1i=1yi=11{Sit})P@k^(S,y)=11ki=1nyi1{iMk}where Mk is the set of indices achieving the highest k scores.\begin{aligned} \widehat{\mathrm{AUC}}(S, y) &= 1 - \frac{1}{n_0} \sum_{y_i = 1} \frac{1}{n_1} \sum_{y_j = 0} \mathbf{1}\{S_j < S_i\} \\ \widehat{\mathrm{KS}}(S, y) &= 1 - \max_{t \in \mathbb{R}} \left( \frac{1}{n_0} \sum_{\substack{j=1 \\ y_j = 0}} \mathbf{1}\{S_j \leq t\} - \frac{1}{n_1} \sum_{\substack{i=1 \\ y_i = 1}} \mathbf{1}\{S_i \leq t\} \right) \\ \widehat{\mathrm{P@k}}(S, y) &= 1 - \frac{1}{k} \sum_{i=1}^n y_i \mathbf{1}\{i \in \mathcal{M}_k\} \quad \text{where $\mathcal{M}_k$ is the set of indices achieving the highest $k$ scores.} \end{aligned}

They are non-decomposable, and non-differentiable in the sense that their differential is nil everywhere we can differentiate them (they are combinatorial).

The usual way to deal with this is by instead optimizing a surrogate: another loss, which is differentiable and decomposable and everything else we may desire, but which has some similarity with our original loss. The hope is that, by optimizing this surrogate loss, we’ll indirectly be optimizing our original loss.

ExactBoost takes a different approach: we instead optimize the exact loss we’d like to optimize, despite its complexity. We show that this has practical advantages in producing quality models.

We also extend margin theory — an area that focuses on decomposable losses — to the non-decomposable setting, and use this extension to provably bound the generalization error of ExactBoost for our three losses.

ExactBoost was evaluated on 30 different datasets under two settings: as a standalone estimator (i.e., receiving the unlabeled data and yielding predicted labels) and as an ensembler (i.e., receiving other models’ predictions and yielding improved predicted labels). It proved to be a competitive estimator, on par with state-of-the-art models such as XGBoost and Random Forests; but, even further, it was shown to be an exceptional ensembler, outperforming alternative models often by a significant amount.

Evaluation of ExactBoost as an ensembler, compared against alternative well-established models (5-fold cross validation). The “Exact Benchmark” column corresponds to RankBoost (for AUC), DMKS (for KS) and TopPush (for P@k). For each dataset (each row), the best performing model is highlighted. ExactBoost is generally the best performer.

By iteratively optimizing our loss via a greedy interval-arithmetic based algorithm, ExactBoost is able to effectively and efficiently optimize our tricky losses: it has an overall runtime complexity of O(pnlogn)O(pn \log n) (where the input data has nn observations and pp features) and memory complexity of O(n)O(n).

An efficient implementation, written in C++ and with Python bindings, is available at dccsillag/exactboost. Another implementation, written in Rust, is in progress.