A Comparative Study of CatBoost and Double Random Forest
for Multi-class Classification
Dublin Core
Title
A Comparative Study of CatBoost and Double Random Forest
for Multi-class Classification
for Multi-class Classification
Subject
catboost, double random forest, multi-class classification, lime
Description
Multi-class classification has its challenge compared to binary classification. The challenges mainly caused by the interactions
between explanatory and responses variable are increasingly complex. Ensemble-based methods such as boosting and random
forest (RF) have been proven to handle classification problems. We conducted this research to study multi-class classification
using CatBoost, a method developed with gradient boosting and double random forest (DRF), RF’s development that is good
to be used when the resulting RF model is underfitting. Analysis was carried out using simulation and empirical data. In the
simulation study, we generate data based on the distance between classes: high, medium, and low. The empirical data used is
the industrial classification code, namely KBLI. CatBoost and DRF can rightly solve the multi-class classification problem at
a high distance, measured by a 100% balanced accuracy score. At a medium distance, CatBoost and DRF produce balanced
accuracy scores of 99.25% and 97.54%, respectively, whereas 32.37% and 23.97% at the low distance. In empirical studies,
CatBoost’s performance outperforms DRF by 4.27%. All the differences are statistically significant based on the t-test result.
We also use LIME to explain individual predictions of CatBoost and learn words that contribute the most to an example class’s
prediction.
between explanatory and responses variable are increasingly complex. Ensemble-based methods such as boosting and random
forest (RF) have been proven to handle classification problems. We conducted this research to study multi-class classification
using CatBoost, a method developed with gradient boosting and double random forest (DRF), RF’s development that is good
to be used when the resulting RF model is underfitting. Analysis was carried out using simulation and empirical data. In the
simulation study, we generate data based on the distance between classes: high, medium, and low. The empirical data used is
the industrial classification code, namely KBLI. CatBoost and DRF can rightly solve the multi-class classification problem at
a high distance, measured by a 100% balanced accuracy score. At a medium distance, CatBoost and DRF produce balanced
accuracy scores of 99.25% and 97.54%, respectively, whereas 32.37% and 23.97% at the low distance. In empirical studies,
CatBoost’s performance outperforms DRF by 4.27%. All the differences are statistically significant based on the t-test result.
We also use LIME to explain individual predictions of CatBoost and learn words that contribute the most to an example class’s
prediction.
Creator
Annisarahmi Nur Aini Aldania1
, Agus Mohamad Soleh2*
, Khairil Anwar Notodiputro3
, Agus Mohamad Soleh2*
, Khairil Anwar Notodiputro3
Publisher
IPB University
Date
: 03-02-2023
Contributor
Fajar bagus W
Format
PDF
Language
Indonesia
Type
Text
Files
Collection
Citation
Annisarahmi Nur Aini Aldania1
, Agus Mohamad Soleh2*
, Khairil Anwar Notodiputro3, “A Comparative Study of CatBoost and Double Random Forest
for Multi-class Classification,” Repository Horizon University Indonesia, accessed June 6, 2025, https://repository.horizon.ac.id/items/show/9354.
for Multi-class Classification,” Repository Horizon University Indonesia, accessed June 6, 2025, https://repository.horizon.ac.id/items/show/9354.