Why CatBoost Works So Well: The Engineering Behind the Magic | Towards Data Science

April 10, 2025

9

is a cornerstone technique for modeling tabular data due to its speed and simplicity. It delivers great results without any fuss. When you look around you’ll see multiple options like LightGBM, XGBoost, etc. Catboost is one such variant. In this post, we will take a detailed look at this model, explore its inner workings, and understand what makes it a great choice for real-world tasks.

Target Statistic

Target Encoding Example: the average value of the target variable for a category is used to replace each category. Image by author

Target Encoding Example: the average value of the target variable for a category is used to replace each category

One of the important contributions of the CatBoost paper is a new method of calculating the Target Statistic. What is a Target Statistic? If you have worked with categorical variables before, you’d know that the most rudimentary way to deal with categorical variables is to use one-hot encoding. From experience, you’d also know that this introduces a can of problems like sparsity, curse of dimensionality, memory issues, etc. Especially for categorical variables with high cardinality.

Greedy Target Statistic

To avoid one-hot encoding, we calculate the Target Statistic instead for the categorical variables. This means we calculate the mean of the target variable at each unique value of the categorical variable. So if a categorical variable takes the values — A, B, C then we will calculate the average value of \(\text{y}\) over all these values and replace these values with the average of \(\text{y}\) at each unique value.

That sounds good, right? It does but this approach comes with its problems — namely Target Leakage. To understand this, let’s take an extreme example. Extreme examples are often the easiest way to eke out issues in the approach. Consider the below dataset:

Categorical Column	Target Column
A	0
B	1
C	0
D	1
E	0

Greedy Target Statistic: Compute the mean target value for each unique category

Now let’s write the equation for calculating the Target Statistic:
\[\hat{x}^i_k = \frac{
\sum_{j=1}^{n} 1_{{x^i_j = x^i_k}} \cdot y_j + a p
}{
\sum_{j=1}^{n} 1_{{x^i_j = x^i_k}} + a
}\]

Here \(x^i_j\) is the value of the i-th categorical feature for the j-th sample. So for the k-th sample, we iterate over all samples of \(x^i\), select the ones having the value \(x^i_k\), and take the average value of \(y\) over those samples. Instead of taking a direct average, we take a smoothened average which is what the \(a\) and \(p\) terms are for. The \(a\) parameter is the smoothening parameter and \(p\) is the global mean of \(y\).

If we calculate the Target Statistic using the formula above, we get:

Categorical Column	Target Column	Target Statistic
A	0	\(\frac{ap}{1+a}\)
B	1	\(\frac{1+ap}{1+a}\)
C	0	\(\frac{ap}{1+a}\)
D	1	\(\frac{1+ap}{1+a}\)
E	0	\(\frac{ap}{1+a}\)

Calculation of Greedy Target Statistic with Smoothening

Now if I use this Target Statistic column as my training data, I will get a perfect split at \( threshold = \frac{0.5+ap}{1+a}\). Anything above this value will be classified as 1 and anything below will be classified as 0. I have a perfect classification at this point, so I get 100% accuracy on my training data.

Let’s take a look at the test data. Here, since we are assuming that the feature has all unique values, the Target Statistic becomes—
\[TS = \frac{0+ap}{0+a} = p\]
If \(threshold\) is greater than \(p\), all test data predictions will be \(0\). Conversely, if \(threshold\) is less than \(p\), all test data predictions will be \(1\) leading to poor performance on the test set.

Although we rarely see datasets where values of a categorical variable are all unique, we do see cases of high cardinality. This extreme example shows the pitfalls of using Greedy Target Statistic as an encoding approach.

Leave One Out Target Statistic

So the Greedy TS didn’t work out quite well for us. Let’s try another method— the Leave One Out Target Statistic method. At first glance, this looks promising. But, as it turns out, this too has its problems. Let’s see how with another extreme example. This time let’s assume that our categorical variable \(x^i\) has only one unique value, i.e., all values are the same. Consider the below data:

Categorical Column	Target Column
A	0
A	1
A	0
A	1

Example data for an extreme case where a categorical feature has just one unique value

If calculate the leave one out target statistic, we get:

Categorical Column	Target Column	Target Statistic
A	0	\(\frac{n^+ -y_k + ap}{n+a}\)
A	1	\(\frac{n^+ -y_k + ap}{n+a}\)
A	0	\(\frac{n^+ -y_k + ap}{n+a}\)
A	1	\(\frac{n^+ -y_k + ap}{n+a}\)

Calculation of Leave One Out Target Statistic with Smoothening

Here:
\(n\) is the total samples in the data (in our case this 4)
\(n^+\) is the number of positive samples in the data (in our case this 2)
\(y_k\) is the value of the target column in that row
Substituting the above, we get:

Categorical Column	Target Column	Target Statistic
A	0	\(\frac{2 + ap}{4+a}\)
A	1	\(\frac{1 + ap}{4+a}\)
A	0	\(\frac{2 + ap}{4+a}\)
A	1	\(\frac{1 + ap}{4+a}\)

Substituing values of n and n<sup>+</sup>

Now, if I use this Target Statistic column as my training data, I will get a perfect split at \( threshold = \frac{1.5+ap}{4+a}\). Anything above this value will be classified as 0 and anything below will be classified as 1. I have a perfect classification at this point, so I again get 100% accuracy on my training data.

You see the problem, right? My categorical variable which doesn’t have more than a unique value is producing different values for Target Statistic which will perform great on the training data but will fail miserably on the test data.

Ordered Target Statistic

Illustration of ordered learning: CatBoost processes data in a randomly permuted order and predicts each sample using only the earlier samples (Image by Author) — Illustration of ordered learning: CatBoost processes data in a randomly permuted order and predicts each sample using only the earlier samples. Image by author

CatBoost introduces a technique called Ordered Target Statistic to address the issues discussed above. This is the core principle of CatBoost’s handling of categorical variables.

This method, inspired by online learning, uses only past data to make predictions. CatBoost generates a random permutation (random ordering) of the training data(\(\sigma\)). To compute the Target Statistic for a sample at row \(k\), CatBoost uses samples from row \(1\) to \(k-1\). For the test data, it uses the entire train data to compute the statistic.

Additionally, CatBoost generates a new permutation for each tree, rather than reusing the same permutation each time. This reduces the variance that can arise in the early samples.

Ordered Boosting

Another important innovation introduced by the CatBoost paper is its use of Ordered Boosting. It builds on similar principles as ordered target statistics, where CatBoost randomly permutes the training data at the start of each tree and makes predictions sequentially.

In traditional boosting methods, when training tree \(t\), the model uses predictions from the previous tree \(t−1\) for all training samples, including the one it is currently predicting. This can lead to target leakage, as the model may indirectly use the label of the current sample during training.

To address this issue, CatBoost uses Ordered Boosting where, for a given sample, it only uses predictions from previous rows in the training data to calculate gradients and build trees. For each row \(i\) in the permutation, CatBoost calculates the output value of a leaf using only the samples before \(i\). The model uses this value to get the prediction for row \(i\). Thus, the model predicts each row without looking at its label.

CatBoost trains each tree using a new random permutation to average the variance in early samples in one permutation.
Let’s say we have 5 data points: A, B, C, D, E. CatBoost creates a random permutation of these points. Suppose the permutation is: σ = [C, A, E, B, D]

Step	Data Used to Train	Data Point Being Predicted	Notes
1	—	C	No previous data → use prior
2	C	A	Model trained on C only
3	C, A	E	Model trained on C, A
4	C, A, E	B	Model trained on C, A, E
5	C, A, E, B	D	Model trained on C, A, E, B

Table highlighting how CatBoost uses random permutation to perform training

This avoids using the actual label of the current row to get the prediction thus preventing leakage.

Building a Tree

Each time CatBoost builds a tree, it creates a random permutation of the training data. It calculates the ordered target statistic for all the categorical variables with more than two unique values. For a binary categorical variable, it maps the values to zeros and ones.

CatBoost processes data as if the data is arriving sequentially. It begins with an initial prediction of zero for all instances, meaning the residuals are initially equivalent to the target values.

As training proceeds, CatBoost updates the leaf output for each sample using the residuals of the previous samples that fall into the same leaf. By not using the current sample’s label for prediction, CatBoost effectively prevents data leakage.

Split Candidates

Histogram showing how continuous features can be divided into bins—CatBoost evaluates splits using these binned values instead of raw continuous values — CatBoost bins continuous features to reduce the search space for optimal splits. Each bin edge and split point represents a potential decision threshold. Image by author

At the core of a decision tree lies the task of selecting the optimal feature and threshold for splitting a node. This involves evaluating multiple feature-threshold combinations and selecting the one that gives the best reduction in loss. CatBoost does something similar. It discretizes the continuous variable into bins to simplify the search for the optimal combination. It evaluates each of these feature-bin combinations to determine the best split

CatBoost uses Oblivious Trees, a key difference compared to other trees, where it uses the same split across all nodes at the same depth.

Oblivious Trees

Unlike standard decision trees, where different nodes can split on different conditions (feature-threshold), Oblivious Trees split across the same conditions across all nodes at the same depth of a tree. At a given depth, all samples are evaluated at the same feature-threshold combination. This symmetry has several implications:

Speed and simplicity: since the same condition is applied across all nodes at the same depth, the trees produced are simpler and faster to train
Regularization: Since all trees are forced to apply the same condition across the tree at the same depth, there is a regularization effect on the predictions
Parallelization: the uniformity of the split condition, makes it easier to parallelize the tree creation and usage of GPU to accelerate training

Conclusion

CatBoost stands out by directly tackling a long-standing challenge: how to handle categorical variables effectively without causing target leakage. Through innovations like Ordered Target Statistics, Ordered Boosting, and the use of Oblivious Trees, it efficiently balances robustness and accuracy.

If you found this deep dive helpful, you might enjoy another deep dive on the differences between Stochastic Gradient Classifer and Logistic Regression

Share this:

Like this:

Introduction

Our Mission

What We Do

Why Choose Us?

Our Team

Get Involved

Share this:

Like this:

Why CatBoost Works So Well: The Engineering Behind the Magic | Towards Data Science

Target Statistic

Greedy Target Statistic

Leave One Out Target Statistic

Ordered Target Statistic

Ordered Boosting

Building a Tree

Split Candidates

Oblivious Trees

Conclusion

Further Reading

Share this:

Like this:

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

POPULAR CATEGORY

ABOUT US

FOLLOW US