RT Journal Article T1 On mathematical optimization for clustering categories in contingency tables A1 Carrizosa, Emilio A1 Guerrero Lozano, Vanesa A1 Romero Morales, Dolores AB Many applications in data analysis study whether two categorical variables are independentusing a function of the entries of their contingency table. Often, the categoriesof the variables, associated with the rows and columns of the table, are grouped, yieldinga less granular representation of the categorical variables. The purpose of this isto attain reasonable sample sizes in the cells of the table and, more importantly, toincorporate expert knowledge on the allowable groupings. However, it is known thatthe conclusions on independence depend, in general, on the chosen granularity, as inthe Simpson paradox. In this paper we propose a methodology to, for a given contingencytable and a fixed granularity, find a clustered table with the highest χ2 statistic.Repeating this procedure for different values of the granularity, we can either identifyan extreme grouping, namely the largest granularity for which the statistical dependenceis still detected, or conclude that it does not exist and that the two variables aredependent regardless of the size of the clustered table. For this problem, we proposean assignment mathematical formulation and a set partitioning one. Our approach isflexible enough to include constraints on the desirable structure of the clusters, such asmust-link or cannot-link constraints on the categories that can, or cannot, be mergedtogether, and ensure reasonable sample sizes in the cells of the clustered table fromwhich trustful statistical conclusions can be derived. We illustrate the usefulness ofour methodology using a dataset of a medical study. PB Springer SN 1862-5347 YR 2022 FD 2022-06-28 LK https://hdl.handle.net/10016/35451 UL https://hdl.handle.net/10016/35451 LA eng NO Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This research has been financed in part by research projects EC H2020 MSCA RISE NeEDS (Grant agreement ID: 822214), FQM-329, P18-FR-2369 and US-1381178 (Junta de Andalucía, with FEDER Funds), PID2019-110886RB-I00 and PID2019-104901RB-I00 (funded by MCIN/AEI/10.13039/501100011033). This support is gratefully acknowledged. DS e-Archivo RD 1 sept. 2024