devarena logo
Reading Time: 16 minutes

Boosting algorithms have become one of the most powerful algorithms for training on structural (tabular) data. The three most famous boosting algorithm implementations that have provided various recipes for winning ML competitions are:

  • 1CatBoost
  • 2XGBoost
  • 3LightGBM

In this article, we will primarily focus on CatBoost, how it fares against other algorithms and when you should choose it over others.

Learn more about XGBoost and LighGBM

Overview of gradient boosting

To understand boosting, we must first understand ensemble learning, a set of techniques that combine the predictions from multiple models(weak learners) to get better predictive performance. Its strategy is simply strength in unity, as efficient combinations of weak learners can generate more accurate and robust models. The three main classes of ensemble learning methods are:

  • Bagging: This technique builds different models in parallel using random subsets of data and deterministically aggregates the predictions of all predictors.
  • Boosting: This technique is iterative, sequential, and adaptive as each predictor fixes its predecessor’s error.
  • Stacking: It is a meta-learning technique that involves combining predictions from multiple machine learning algorithms, like bagging and boosting.

In 1988, Micheal Kearns, in his paper Thoughts on Hypothesis Boosting, presented the idea of whether a relatively poor hypothesis can be converted to very good hypotheses. In essence, whether a weak learner can be modified to become better. Since then, there have been multiple successful applications of the technique to develop some powerful boosting algorithms.

The most popular boosting algorithms: Catboost, XGBoost, LightGBM | Source: Author

The three algorithms in scope (CatBoost, XGBoost, and LightGBM) are all variants of gradient boosting algorithms. A good understanding of gradient boosting will be beneficial as we progress. Gradient boosting algorithms can be a Regressor (predicting continuous target variables) or a Classifier (predicting categorical target variables).

This technique involves training learners based upon minimizing the differential loss function of a weak learner using a gradient descent optimization process, in contrast to tweaking the weights of the training instances like Adaptive Boosting (Adaboost). Hence there is an equal distribution of weights to all the learners. Gradient boosting uses decision trees connected in series as weak learners. Due to its sequential architecture, it is a stage-wise additive model, where decision trees are added one at a time, and existing decision trees are not changed.

Gradient boosting is primarily used to reduce the bias error of the model. Based on the bias-variance tradeoff, it is a greedy algorithm that can overfit a training dataset quickly. However, this overfitting can be controlled by shrinkage, tree constraint, regularization, and stochastic gradient boosting.

Overview of CatBoost

CatBoost is an open-source machine learning(gradient boosting) algorithm, with its name coined from “Category” and “Boosting.” It was developed by Yandex (Russian Google 😁) in 2017. According to Yandex, CatBoost has been applied to a wide range of areas such as recommendation systems, search ranking, self-driving cars, forecasting, and virtual assistants. It is the successor of MatrixNet that was widely used within Yandex products.

Key features of CatBoost

Let’s take a look at some of the key features that make CatBoost better than its counterparts:

  1. Symmetric trees: CatBoost builds symmetric (balanced) trees, unlike XGBoost and LightGBM. In every step, leaves from the previous tree are split using the same condition. The feature-split pair that accounts for the lowest loss is selected and used for all the level’s nodes. This balanced tree architecture aids in efficient CPU implementation, decreases prediction time, makes swift model appliers, and controls overfitting as the structure serves as regularization.
Symmetric trees
Asymmetric tree vs symmetric tree | Source: Author
  1. Ordered boosting:  Classic boosting algorithms are prone to overfitting on small/noisy datasets due to a problem known as prediction shift. When calculating the gradient estimate of a data instance, these algorithms use the same data instances that the model was built with, thus having no chances of experiencing unseen data. CatBoost, on the other hand, uses the concept of ordered boosting, a permutation-driven approach to train model on a subset of data while calculating residuals on another subset, thus preventing target leakage and overfitting.
  1. Native feature support: CatBoost supports all kinds of features be it numeric, categorical, or text and saves time and effort of preprocessing.

Numerical features

CatBoost handles numeric features like other tree-based algorithms, i.e. by selecting the best possible split based on the information gain.

Numerical features
Numerical features | Source: Author

Categorical features

Decision trees split categorical features based on classes rather than a threshold in continuous variables. The split criterion is intuitive as the classes are divided into sub-nodes.

Categorical features
Categorical features | Source: Author

Categorical features can be more complex in high cardinality features like ‘id‘ features. Every machine learning algorithm requires parsing of input and output variables in numerical form; CatBoost provides the various native strategies to handle categorical variables:

  • One-hot encoding: By default, CatBoost represents all binary(two-category) features with one-hot encoding. This strategy can be extended to features with N number categories by changing the training parameter  one_hot_max_size = N.  CatBoost handles one-hot encoding by specifying the categorical features and categories to yield better, faster, and quality results.
  • Statistics based on category: CatBoost applies target encoding with random permutation to handle categorical features. This strategy can be very efficient for high cardinality columns as it creates just a new feature to account for the category encoding. The addition of random permutation to the encoding strategy is to prevent overfitting due to data leakage and feature bias. You can read about this in detail here.
  • Greedy search for combination: CatBoost also automatically combines categorical features, most times two or three. To keep possible combinations limited, CatBoost does not enumerate through all the combinations but rather some of the best, using statistics like category frequency. So, for each tree split, CatBoost adds all categorical features (and their combinations) already used for previous splits in the current tree with all categorical features in the dataset.

Text features

CatBoost also handles text features (containing regular text) by providing inherent text preprocessing using Bag-of-Words (BoW), Naive-Bayes, and BM-25 (for multiclass) to extract words from text data, create dictionaries (letter, words, grams), and transform them into numeric features. This text transformation is fast, customizable, production-ready, and can be used with other libraries too, including Neural networks.

  1. Ranking: Ranking techniques are applied majorly to search engines to solve search relevancy problems. Ranking can be broadly done under three objective functions: Pointwise, Pairwise, and Listwise. The difference on a high level of these three objective functions is the number of instances under consideration at the time of training your model.

CatBoost has a ranking mode – CatBoostRanking just like XGBoost ranker and LightGBM ranker, however, it provides many more powerful variations than XGBoost and LightGBM. The variations are:

  • Ranking (YetiRank, YetiRankPairwise)
  • Pairwise (PairLogit, PairLogitPairwise)
  • Ranking + Classification (QueryCrossEntropy)
  • Ranking + Regression (QueryRMSE)
  • Select top 1 candidate (QuerySoftMax)

CatBoost also provides ranking benchmarks comparing CatBoost, XGBoost and LightGBM with different ranking variations which includes:

  • CatBoost: RMSE, QueryRMSE, PairLogit, PairLogitPairwise, YetiRank, YetiRankPairwise
  • XGBoost: reg:linear, xgb-lmart-ndcg, xgb-pairwise
  • LightGBM: lgb-rmse, lgb-pairwise

These benchmarks evaluation used four (4) top ranking datasets:

  1. Million queries dataset from TREC 2008, MQ2008, (train and test folds).
  2. Microsoft LETOR dataset (WEB-10K), MSLR (First set, train, and test folds).
  3. Yahoo LETOR dataset (C14), Yahoo (First set, set1.train.txt and set1.test.txt files).
  4. Yandex LETOR dataset, Yandex (features.txt.gz and featuresTest.txt.gz files).

The results were as follows using the mean NDCG metric for performance evaluation:

It can be seen that CatBoost outperforms LightGBM and XGBoost in all cases. More details of the ranking mode variations and their respective performance metrics can be found on CatBoost documentation here. These techniques can be run both on CPU and GPU.

  1. Speed: CatBoost provides scalability by supporting multi-server distributed GPUs(enabling multiple hosts for accelerated learning) and accommodating older GPUs. It has set some CPU and GPU training speed benchmarks on large datasets like Epsilon and Higgs. Its prediction time came out to be faster than XGBoost and LightGBM; this is extremely important for low latency environments.
Dataset Epsilon (400K samples, 2000 features). Parameters: 128 bins, 64 leafs, 400 iterations.
Dataset Epsilon (400K samples, 2000 features). Parameters: 128 bins, 64 leafs, 400 iterations | Source: Author
Dataset Higgs (4M samples, 28 features). Parameters: 128 bins, 64 leafs, 400 iterations.
Dataset Higgs (4M samples, 28 features). Parameters: 128 bins, 64 leafs, 400 iterations | Source: Author
Prediction time on CPU and GPU respectively on the Epsilon dataset
Prediction time on CPU and GPU respectively on the Epsilon dataset | Source: Author

 

  1. Model analysis: CatBoost provides inherent model analysis tools to help understand, diagnose and refine machine learning models with the help of efficient statistics and visualization. Some of them are:

Feature importance

CatBoost has some intelligent techniques for finding the best features for a given model:

  • PredictionValuesChange: This shows how much, on average, the prediction changes over the feature value changes. The bigger the average values of prediction changes due to features, the higher the importance. Feature importance values are normalized to avoid negation, and all features’ importances are equal to 100. It is easy to compute but can lead to misleading results for ranking problems.
Feature Importance based on PredictionValuesChange
Feature Importance based on PredictionValuesChange | Source: Author
  • LossFunctionChange: This is a heavy computing technique that gets feature importance by taking the difference between the loss function of a model, including a given feature, and the model without that feature. The higher the difference, the more the feature is important.
Feature Importance based on LossFunctionChange
Feature Importance based on LossFunctionChange | Source: Author
  • InternalFeatureImportance: This technique calculates values for each input feature and various combinations using the split values in the node on the path symmetric tree leaves.
Pairwise feature importance
Pairwise feature importance | Source: Author
  • SHAP: CatBoost uses SHAP (SHapley Additive exPlanations) to break a prediction value into contributions from each feature. It calculates feature importance by measuring the impact of a feature on a single prediction value compared to the baseline prediction. This technique provides visual explanations of features that make the most impact on your model’s decision-making. SHAP can be applied in two ways:

Per data instance

First prediction explanation (Waterfall plot)
First prediction explanation (Waterfall plot) | Source: Author

The above visualization shows the features pushing the model output from the base value (the average model output over the training dataset) to the model output. The red features are the ones pushing the prediction higher, while the blue features push the prediction lower. This concept can be visualized using the force plot.

First prediction explanation (Force plot)
First prediction explanation (Force plot) | Source: Author

Whole dataset

SHAP provides plotting capabilities to highlight the most important features of a model. The plot sorts features by the sum of SHAP value magnitudes over all data instances and use SHAP values to highlight the impact distribution of each feature on the model output.

Summarized effects of all the features
Summarized effects of all the features | Source: Author

Feature analysis chart

This is another unique feature that CatBoost has integrated into its recent version. This functionality provides calculated and plotted feature-specific statistics and visualizes how CatBoost is splitting the data for each feature. More specifically, the statistics are:

  • Mean target value for each bin (bins groups continuous feature) or category (supported currently for only One-Hot Encoded features).
  • Mean prediction value for each bin
  • Number of data instances (object) in each bin
  • Predictions for various feature values
Statistics of feature
Statistics of feature | Source: Author

CatBoost parameters

CatBoost has common training parameters with XGBoost and LightGBM but provides a much flexible interface for parameter tuning. The following table provides a quick comparison of parameters offered by the three boosting algorithms.

Function

CatBoost

XGBoost

LightGBM

Parameters controlling overfitting


CatBoost:

– learning_rate

– depth

– l2_reg


XGBoost:

– learning_rate

– max_depth

– min_child_weight


LightGBM:

– learning_rate

– Max_depth

– Num_leaves

– min_data _in_leaf

Parameters for handling categorical values


CatBoost:

– cat_features

– one_hot_max_size


LightGBM:

Categorical_feature

Parameters for controlling speed


CatBoost:

– rsm

– iteration


XGBoost:

– colsample_bytree

– subsample

– n_estimators


LightGBM:

– feature_fraction

– bagging fraction

– num_iterations

Also, as evident from the following image, CatBoost’s default parameters provide an excellent baseline model, quite better than other boosting algorithms.

Log loss values (lower is better) for Classification mode. The percentage is metric difference measured against tuned CatBoost results.
Log loss values (lower is better) for Classification mode.
The percentage is metric difference measured against tuned CatBoost results | Source: Author

You can read all about CatBoost’s parameters here. These parameters control overfitting, categorical features, and speed.

Other useful features

  • Overfitting detector: CatBoost’s algorithm structure inhibits gradient boosting biases and overfitting. In addition, CatBoost has an overfitting detector that can stop training earlier than the training parameters dictate if overfitting occurs. CatBoost implements overfitting detection using two strategies:
    • Iter: Consider the overfitted model and stop training after the specified number of iterations using the iteration with the optimal metric value. This strategy uses the early_stopping_rounds parameter like other gradient boosting algorithms like LightGBM and XGBoost.
    • IncToDec: Ignore the overfitting detector when the threshold is reached and continue learning for the specified number of iterations after the iteration with the optimal metric value. The overfitting detector is activated by setting “od_type” in the parameters to produce more generalized models.
  • Missing value support: CatBoost provides three inherent missing values strategies for processing missing values:
    • “Forbidden”: Missing values are interpreted as an error as they are not supported.
    • “Min”: Missing values are processed as the minimum value(less than all other values) for the feature under observation.
    • “Max”: Missing values are processed as the maximum value(greater than all other values) for the feature under observation. CatBoost only has missing values imputation for numerical values only and the default mode in Min.
  • CatBoost viewer: In addition to the CatBoost model analysis tool, CatBoost has a standalone executable application for plotting charts with different training statistics in a browser.
  • Cross-validation: CatBoost allows to perform cross-validation on the given dataset. In cross-validation mode, the training data is split into folds of learning and evaluation.
  • Community support: CatBoost has a vast and growing open-source community that provides a lot of tutorials on theories and applications.

CatBoost vs XGBoost and LightGBM: hands-on comparison of performance and speed

The previous sections covered some of CatBoost’s features that will serve as potent criteria in choosing CatBoost over LightGBM and XGBoost. This section will have a hands-on experience as we compare performance and speed using a flight delay prediction problem.

Dataset and environment

The dataset contains on-time performance data of domestic flights operated by large air carriers in 2015, provided by The U.S. Department of Transportation (DOT), and can be found on Kaggle. This comparative analysis explores and models the flight delay with the available independent features using the CatBoost, LightGBM, and XGBoost. A subset (25%) of this data was used for modeling, and the respective generated models will be evaluated using the ROC AUC score. The analysis will cover default and tuned settings while measuring training time, prediction time, and parameter tuning time.

For ease of comparison, we will be using Neptune, a metadata store for MLOps, built for projects that may involve a lot of experiments.‌ Specifically, we will be using Neptune for:

So, without further ado, let’s get started!

First, we have to install the required libraries.

!pip install catboost
!pip install xgboost
!pip install lgb
!pip install neptune-client

Import the installed libraries.

import lightgbm as lgb
import xgboost as xgb
import catboost as cb
 

import timeit
import pandas as pd
import numpy as np
import neptune.new as neptune
 

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score

Setting the Neptune client to log the project’s metadata appropriately. You can read more about it here.

import neptune.new as neptune

run = neptune.init(project='',
                  api_token='')

The data preprocessing and wrangling operations can be found in the reference notebook. We will be using 30% of the data as the test set.

X = data_df.drop(columns=['ARRIVAL_DELAY'])
y = data_df['ARRIVAL_DELAY']
 

X_train, X_test,  y_train, y_test= train_test_split(X,y, random_state=2021, test_size=0.30)

Models

Next, let’s define the metric evaluation function and model execution function. The metric evaluation function logs the ROC AUC score.

def metrics(run, y_pred_test):
   score = roc_auc_score(y_test, y_pred_test)
   run['ROC AUC score'] = score

Now on to the model execution function which accepts four main arguments:

  • model: The respective machine learning models generated i.e. the LightGBM, XGBoost and CatBoost
  • description: The description of the model execution instance
  • key: The key specifies the model training setup, especially the categorical feature parameters to be implemented
  • cat_features: serves as the categorical features names (for LightGBM) or index(CatBoost)

The function calculates and logs the metadata including description, training time, prediction time, and ROC AUC score.

def run_model(run, model, description, key, cat_features=''):
 if key =='LGB':
   
   run["Description"] = description
 
   
   start = timeit.default_timer()
   model.fit(X_train,y_train, categorical_feature=cat_features)
   stop = timeit.default_timer()
   run['Training time'] = stop - start
 
   
   start = timeit.default_timer()
   y_pred_test = model.predict(X_test)
   stop = timeit.default_timer()
   run['Prediction time'] = stop - start
 
   
   metrics(y_pred_test)
 
 elif key =='CAT':
   
   run["Description"] = description
 
   
   start = timeit.default_timer()
   model.fit(X_train,y_train,
             eval_set=(X_test, y_test),
             cat_features=cat_features,
             use_best_model=True)
   stop = timeit.default_timer()
   run['Training time'] = stop - start
 
   
   start = timeit.default_timer()
   y_pred_test = model.predict(X_test)
   stop = timeit.default_timer()
   run['Prediction time'] = stop - start
 
   
   metrics(y_pred_test)
 
 else:
   
   run["Description"] = description
 
   
   start = timeit.default_timer()
   model.fit(X_train,y_train)
   stop = timeit.default_timer()
   run['Training time'] = stop - start
 
   
   start = timeit.default_timer()
   y_pred_test = model.predict(X_test)
   stop = timeit.default_timer()
   run['Prediction time'] = stop - start
 
   
   metrics(y_pred_test)

Let’s run the function with the respective models in two settings:

1. CatBoost vs XGBoost vs LightGBM: default hyperparameters

model_lgb_def = lgb.LGBMClassifier()
run_model(model_lgb_def,'Default LightGBM without categorical support', key='LGB')
 

model_lgb_cat_def = lgb.LGBMClassifier()
run_model(model_lgb_cat_def, 'Default LightGBM with categorical support',key='LGB', cat_features=cat_cols)
 

model_xgb_def = xgb.XGBClassifier()
run_model(model_xgb_def, 'Default XGBoost', key='XGB')
 

model_cat_def = cb.CatBoostClassifier()
run_model(model_cat_def,'Default CatBoost without categorical support', key='CAT')
 

model_cat_cat_def = cb.CatBoostClassifier()
cat_features_index = [3,4,5]
run_model(model_cat_cat_def,'Default CatBoost with categorical support','CAT', cat_features_index)

Comparative analysis based on the default setting of the LightGBM, XGBoost, and CatBoost algorithms can be viewed on your Neptune dashboard.

Default setting comparative analysis
Default setting comparative analysis | Source: Neptune

Results: default setting

As evident from the dashboard:

  • CatBoost had the fastest prediction time without categorical support, consequently increasing substantially with categorical support.
  • CatBoost also had the best score for the AUC metric (the higher the AUC score, the better the model’s performance at distinguishing between the classes) for the test data.
  • XGBoost had the lowest ROC-AUC Score with default settings and a relatively longer training time than LightGBM, however, its prediction time was fast (second-fastest time in the respective default setting runs).
  • LightGBM outperformed every other model in training time.

2. CatBoost vs XGBoost vs LightGBM: tuned hyperparameters

Following are the tuned hyperparameters that we will be using in this run. The selected parameters are quite similar between the three algorithms:

  • The ‘max_depth’ and ‘depth’ control the tree model’s depth.
  • The ‘learning_rate’ accounts for the magnitude of modification added to the tree model and depicts how fast the model learns.
  • The n_estimators and iterations account for the number of trees(rounds), highlighting the number boosting iterations. CatBoost’ l2_leaf_reg’ represents the L2 regularization coefficient to discourage learning a more complex or flexible model to prevent overfitting.
  • While the LightGBM num_leaves parameter corresponds to the maximum number of leaves per tree and XGBoost ‘min-child-weight’  represents the minimum number of instances required to be in each node.

These parameters were tuned to control overfitting and learning speed.

LightBGM

XGBoost

CatBoost


LightBGM:

max_depth: 7

learning_rate: 0.08

num_leaves: 100

n_estimators: 1000


XGBoost:

max_depth: 5

min_child_weight: 6

n_estimators:  1000

learning_rate: 0.08


CatBoost:

depth: 10

learning _rate: 0.5

l2_leaf_reg: 5

Iteration: 1000

The hyperparameter tuning section can be found in the reference notebook.

Now let’s run these models with the aforementioned tuned settings.

params = {"max_depth": 7, "learning_rate" : 0.08, "num_leaves": 100,  "n_estimators": 1000}
 

model_lgb_tun = lgb.LGBMClassifier(boosting_type='gbdt', objective='binary', metric='auc',**params)
run_model(model_lgb_tun, 'Tuned LightGBM without categorical support', 'LGB')
 

model_lgb_cat_tun = lgb.LGBMClassifier(boosting_type='gbdt', objective='binary', metric='auc',**params)
run_model(model_lgb_cat_tun, 'Tuned LightGBM with categorical support', 'LGB', cat_cols)
 

params = {"max_depth": 5, "learning_rate": 0.8, "min_child_weight": 6,  "n_estimators": 1000}
 

model_xgb_tun = xgb.XGBClassifier(**params)
run_model(model_xgb_tun, 'Tuned XGBoost','XGB')
 

params = {"depth": 10, "learning_rate": 0.5, "iterations": 1000, "l2_leaf_reg": 5}
 

model_cat_tun = cb.CatBoostClassifier(**params)
run_model(model_cat_tun,'Tuned CatBoost without categorical support', key='CAT')
 

model_cat_cat_tun = cb.CatBoostClassifier(**params)
cat_features_index = [3,4,5]
run_model(model_cat_cat_tun,'Default CatBoost with categorical support','CAT', cat_features_index)

Again, the Comparative analysis based on the tuned settings can be viewed on your Neptune dashboard.

Tuned setting comparative analysis
Tuned setting comparative analysis | Source: Neptune

Results: tuned setting

As evident from the dashboard:

  • CatBoost still retained the fastest prediction time and best performance score with categorical feature support. CatBoost’s internal identification of categorical data allows it to yield the slowest training time.
  • Despite the hyperparameter tuning, the difference between the default and tuned results are not that much and it also highlights the fact that CatBoost’s default settings yield a great result.
  • XGBoost performance increased with tuned settings, however, it produced the fourth-best AUC-ROC score and the training time and prediction time got worse.
  • LightGBM still had the fastest training time as well as the fastest parameter tuning time. However, CatBoost will make a great choice if you are willing to make the tradeoff of performance over faster training time.

Conclusion

CatBoost’s algorithmic design might be similar to the “older” generation of GBDT models, however, it has some key attributes such as:

  • ranking objective function
  • native categorical features preprocessing
  • model analysis
  • fastest prediction time

CatBoost also provides significant performance potential as it performs remarkably well with default parameters, significantly improving performance when tuned.  This article aimed to help you in making a decision about when to choose CatBoost over LightGBM or XGBoost by talking about these crucial features and the advantages they offer. I hope now you have a good idea about this and the next time you are faced with such a choice, you will be able to make an informed decision.

If you would like to get a deeper look inside all of this, the following links will help you to do just that. That’s all for now!


READ NEXT

How to Compare Machine Learning Models and Algorithms

9 mins read | Author Samadrita Ghosh | Updated September 16th, 2021

Machine learning has expanded rapidly in the last few years. Instead of simple, one-directional, or linear ML pipelines, today data scientists and developers run multiple parallel experiments that can get overwhelming even for large teams. Each experiment is expected to be recorded in an immutable and reproducible format, which results in endless logs with invaluable details.

We need to narrow down on techniques by comparing machine learning models thoroughly with parallel experiments. Using a well-planned approach is necessary to understand how to choose the right combination of algorithms and the data at hand.

So, in this article, we’re going to explore how to approach comparing ML models and algorithms.

The challenge of model selection

Each model or any machine learning algorithm has several features that process the data in different ways. Often the data that is fed to these algorithms is also different depending on previous experiment stages. But, since machine learning teams and developers usually record their experiments, there’s ample data available for comparison.

The challenge is to understand which parameters, data, and metadata must be considered to arrive at the final choice. It’s the classic paradox of having an overwhelming amount of details with no clarity.

Even more challenging, we need to understand if a parameter with a high value, say a higher metric score, actually means the model is better than one with a lower score, or if it’s only caused by statistical bias or misdirected metric design.


Continue reading ->


Source link

Spread the Word!