On mathematical optimization for clustering categories in contingency tables

Carrizosa, Emilio; Guerrero Lozano, Vanesa; Romero Morales, Dolores

Publication:
On mathematical optimization for clustering categories in contingency tables

dc.affiliation.dpto	UC3M. Departamento de Estadística	es
dc.contributor.author	Carrizosa, Emilio
dc.contributor.author	Guerrero Lozano, Vanesa
dc.contributor.author	Romero Morales, Dolores
dc.contributor.funder	European Commission	en
dc.contributor.funder	Agencia Estatal de Investigación (España)	es
dc.date.accessioned	2022-07-13T10:17:30Z
dc.date.available	2022-07-13T10:17:30Z
dc.date.issued	2022-06-28
dc.description.abstract	Many applications in data analysis study whether two categorical variables are independent using a function of the entries of their contingency table. Often, the categories of the variables, associated with the rows and columns of the table, are grouped, yielding a less granular representation of the categorical variables. The purpose of this is to attain reasonable sample sizes in the cells of the table and, more importantly, to incorporate expert knowledge on the allowable groupings. However, it is known that the conclusions on independence depend, in general, on the chosen granularity, as in the Simpson paradox. In this paper we propose a methodology to, for a given contingency table and a fixed granularity, find a clustered table with the highest χ2 statistic. Repeating this procedure for different values of the granularity, we can either identify an extreme grouping, namely the largest granularity for which the statistical dependence is still detected, or conclude that it does not exist and that the two variables are dependent regardless of the size of the clustered table. For this problem, we propose an assignment mathematical formulation and a set partitioning one. Our approach is flexible enough to include constraints on the desirable structure of the clusters, such as must-link or cannot-link constraints on the categories that can, or cannot, be merged together, and ensure reasonable sample sizes in the cells of the clustered table from which trustful statistical conclusions can be derived. We illustrate the usefulness of our methodology using a dataset of a medical study.	en
dc.description.sponsorship	Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This research has been financed in part by research projects EC H2020 MSCA RISE NeEDS (Grant agreement ID: 822214), FQM-329, P18-FR-2369 and US-1381178 (Junta de Andalucía, with FEDER Funds), PID2019-110886RB-I00 and PID2019-104901RB-I00 (funded by MCIN/AEI/10.13039/501100011033). This support is gratefully acknowledged.	en
dc.identifier.bibliographicCitation	Carrizosa, E., Guerrero, V., & Romero Morales, D. (2022). On mathematical optimization for clustering categories in contingency tables. Advances in Data Analysis and Classification.	en
dc.identifier.doi	https://doi.org/10.1007/s11634-022-00508-4
dc.identifier.issn	1862-5347
dc.identifier.publicationfirstpage	1	es
dc.identifier.publicationlastpage	23	es
dc.identifier.publicationtitle	Advances in Data Analysis and Classification	en
dc.identifier.uri	https://hdl.handle.net/10016/35451
dc.identifier.uxxi	AR/0000030950
dc.language.iso	eng	es
dc.publisher	Springer	es
dc.relation.projectID	info:eu-repo/grantAgreement/EC/H2020/822214	es
dc.relation.projectID	Gobierno de España. PID2019-110886RB-I00	es
dc.relation.projectID	Gobierno de España. PID2019-104901RB-I00	es
dc.relation.projectID	AT-2022	es
dc.rights	© The Author(s) 2022	es
dc.rights	Atribución 3.0 España	*
dc.rights.accessRights	open access	en
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/es/	*
dc.subject.eciencia	Estadística	es
dc.subject.other	Contingency Tables	en
dc.subject.other	Mathematical optimization	es
dc.subject.other	Relational constraints	en
dc.subject.other	Clustering	en
dc.title	On mathematical optimization for clustering categories in contingency tables	en
dc.type	research article	*
dc.type.hasVersion	VoR	*
dspace.entity.type	Publication