Study of Hellinger Distance as a splitting metric for Random Forests in balanced and imbalanced classification datasets

Aler, Ricardo; Valls, José M.; Böstrom, Henrik

Publication:
Study of Hellinger Distance as a splitting metric for Random Forests in balanced and imbalanced classification datasets

dc.affiliation.dpto	UC3M. Departamento de Informática	es
dc.affiliation.grupoinv	UC3M. Grupo de Investigación: Computación Evolutiva y Redes Neuronales (EVANNAI)	es
dc.contributor.author	Aler, Ricardo
dc.contributor.author	Valls, José M.
dc.contributor.author	Böstrom, Henrik
dc.contributor.funder	Ministerio de Economía y Competitividad (España)	es
dc.date.accessioned	2020-08-05T08:27:00Z
dc.date.available	2022-07-01T23:00:06Z
dc.date.issued	2020-07-01
dc.description.abstract	Hellinger Distance (HD) is a splitting metric that has been shown to have an excellent performance for imbalanced classification problems for methods based on Bagging of trees, while also showing good performance for balanced problems. Given that Random Forests (RF) use Bagging as one of two fundamental techniques to create diversity in the ensemble, it could be expected that HD is also effective for this ensemble method. The main aim of this article is to carry out an extensive investigation on important aspects about the use of HD in RF, including handling of multi-class problems, hyper-parameter optimization, metrics comparison, probability estimation, and metrics combination. In particular, HD is compared to other commonly used splitting metrics (Gini and Gain Ratio) in several contexts: balanced/imbalanced and two-class/multi-class. Two aspects related to classification problems are assessed: classification itself and probability estimation. HD is defined for two-class problems, but there are several ways in which it can be extended to deal with multi-class and this article studies the performance of the available options. Finally, even though HD can be used as an alternative to other splitting metrics, there is no reason to limit RF to use just one of them. Therefore, the final study of this article is to determine whether selecting the splitting metric using cross-validation on the training data can improve results further. Results show HD to be a robust measure for RF, with some weakness for balanced multi-class datasets (especially for probability estimation). Combination of metrics is able to result in a more robust performance. However, experiments of HD with text datasets show Gini to be more suitable than HD for this kind of problems.	en
dc.description.sponsorship	The first two authors have been funded by the Spanish Ministry of Science under project ENE2014-56126-C2-2-R. The first author was also funded by Fundación Caja Madrid Movilidad 2011-2012.	en
dc.identifier.bibliographicCitation	Ricardo Aler, José M. Valls, Henrik Boström. (2020). Study of Hellinger Distance as a splitting metric for Random Forests in balanced and imbalanced classification datasets. Expert Systems with Applications. 149 (113264).	en
dc.identifier.doi	https://doi.org/10.1016/j.eswa.2020.113264
dc.identifier.issn	0957-4174
dc.identifier.publicationfirstpage	1
dc.identifier.publicationissue	113264
dc.identifier.publicationlastpage	44
dc.identifier.publicationtitle	EXPERT SYSTEMS WITH APPLICATIONS	en
dc.identifier.publicationvolume	149
dc.identifier.uri	https://hdl.handle.net/10016/30751
dc.identifier.uxxi	AR/0000025857
dc.language.iso	eng	en
dc.publisher	Elsevier	en
dc.relation.projectID	Gobierno de España. ENE2014-56126-C2-2-R	es
dc.rights	© 2020 Elsevier Ltd. All rights reserved.	en
dc.rights	Atribución-NoComercial-SinDerivadas 3.0 España	*
dc.rights.accessRights	open access	es
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/es/	*
dc.subject.eciencia	Informática	es
dc.subject.other	Hellinger distance	en
dc.subject.other	Imbalanced problems	en
dc.subject.other	Random forests	en
dc.title	Study of Hellinger Distance as a splitting metric for Random Forests in balanced and imbalanced classification datasets	en
dc.type	research article	*
dc.type.hasVersion	AM	*
dspace.entity.type	Publication