Study of Hellinger Distance as a splitting metric for Random Forests in balanced and imbalanced classification datasets

dc.affiliation.dptoUC3M. Departamento de Informáticaes
dc.affiliation.grupoinvUC3M. Grupo de Investigación: Computación Evolutiva y Redes Neuronales (EVANNAI)es
dc.contributor.authorAler, Ricardo
dc.contributor.authorValls, José M.
dc.contributor.authorBöstrom, Henrik
dc.contributor.funderMinisterio de Economía y Competitividad (España)es
dc.description.abstractHellinger Distance (HD) is a splitting metric that has been shown to have an excellent performance for imbalanced classification problems for methods based on Bagging of trees, while also showing good performance for balanced problems. Given that Random Forests (RF) use Bagging as one of two fundamental techniques to create diversity in the ensemble, it could be expected that HD is also effective for this ensemble method. The main aim of this article is to carry out an extensive investigation on important aspects about the use of HD in RF, including handling of multi-class problems, hyper-parameter optimization, metrics comparison, probability estimation, and metrics combination. In particular, HD is compared to other commonly used splitting metrics (Gini and Gain Ratio) in several contexts: balanced/imbalanced and two-class/multi-class. Two aspects related to classification problems are assessed: classification itself and probability estimation. HD is defined for two-class problems, but there are several ways in which it can be extended to deal with multi-class and this article studies the performance of the available options. Finally, even though HD can be used as an alternative to other splitting metrics, there is no reason to limit RF to use just one of them. Therefore, the final study of this article is to determine whether selecting the splitting metric using cross-validation on the training data can improve results further. Results show HD to be a robust measure for RF, with some weakness for balanced multi-class datasets (especially for probability estimation). Combination of metrics is able to result in a more robust performance. However, experiments of HD with text datasets show Gini to be more suitable than HD for this kind of problems.en
dc.description.sponsorshipThe first two authors have been funded by the Spanish Ministry of Science under project ENE2014-56126-C2-2-R. The first author was also funded by Fundación Caja Madrid Movilidad 2011-2012.en
dc.identifier.bibliographicCitationRicardo Aler, José M. Valls, Henrik Boström. (2020). Study of Hellinger Distance as a splitting metric for Random Forests in balanced and imbalanced classification datasets. Expert Systems with Applications. 149 (113264).en
dc.identifier.publicationtitleEXPERT SYSTEMS WITH APPLICATIONSen
dc.relation.projectIDGobierno de España. ENE2014-56126-C2-2-Res
dc.rights© 2020 Elsevier Ltd. All rights reserved.en
dc.rightsAtribución-NoComercial-SinDerivadas 3.0 España*
dc.rights.accessRightsopen accesses
dc.subject.otherHellinger distanceen
dc.subject.otherImbalanced problemsen
dc.subject.otherRandom forestsen
dc.titleStudy of Hellinger Distance as a splitting metric for Random Forests in balanced and imbalanced classification datasetsen
dc.typeresearch article*
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
1.4 MB
Adobe Portable Document Format