Publication:
Study of Hellinger Distance as a splitting metric for Random Forests in balanced and imbalanced classification datasets

Loading...
Thumbnail Image
Identifiers
Publication date
2020-07-01
Defense date
Advisors
Tutors
Journal Title
Journal ISSN
Volume Title
Publisher
Elsevier
Impact
Google Scholar
Export
Research Projects
Organizational Units
Journal Issue
Abstract
Hellinger Distance (HD) is a splitting metric that has been shown to have an excellent performance for imbalanced classification problems for methods based on Bagging of trees, while also showing good performance for balanced problems. Given that Random Forests (RF) use Bagging as one of two fundamental techniques to create diversity in the ensemble, it could be expected that HD is also effective for this ensemble method. The main aim of this article is to carry out an extensive investigation on important aspects about the use of HD in RF, including handling of multi-class problems, hyper-parameter optimization, metrics comparison, probability estimation, and metrics combination. In particular, HD is compared to other commonly used splitting metrics (Gini and Gain Ratio) in several contexts: balanced/imbalanced and two-class/multi-class. Two aspects related to classification problems are assessed: classification itself and probability estimation. HD is defined for two-class problems, but there are several ways in which it can be extended to deal with multi-class and this article studies the performance of the available options. Finally, even though HD can be used as an alternative to other splitting metrics, there is no reason to limit RF to use just one of them. Therefore, the final study of this article is to determine whether selecting the splitting metric using cross-validation on the training data can improve results further. Results show HD to be a robust measure for RF, with some weakness for balanced multi-class datasets (especially for probability estimation). Combination of metrics is able to result in a more robust performance. However, experiments of HD with text datasets show Gini to be more suitable than HD for this kind of problems.
Description
Keywords
Hellinger distance, Imbalanced problems, Random forests
Bibliographic citation
Ricardo Aler, José M. Valls, Henrik Boström. (2020). Study of Hellinger Distance as a splitting metric for Random Forests in balanced and imbalanced classification datasets. Expert Systems with Applications. 149 (113264).