Citation:
Carrizosa, E., Guerrero, V., & Romero Morales, D. (2022). On mathematical optimization for clustering categories in contingency tables. Advances in Data Analysis and Classification.
xmlui.dri2xhtml.METS-1.0.item-contributor-funder:
European Commission Agencia Estatal de Investigación (España)
Sponsor:
Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This research has been financed in part by research projects EC H2020 MSCA RISE NeEDS (Grant agreement ID: 822214), FQM-329, P18-FR-2369 and US-1381178 (Junta de Andalucía, with FEDER Funds), PID2019-110886RB-I00 and PID2019-104901RB-I00 (funded by MCIN/AEI/10.13039/501100011033). This support is gratefully acknowledged.
Project:
info:eu-repo/grantAgreement/EC/H2020/822214 Gobierno de España. PID2019-110886RB-I00 Gobierno de España. PID2019-104901RB-I00
Many applications in data analysis study whether two categorical variables are independent
using a function of the entries of their contingency table. Often, the categories
of the variables, associated with the rows and columns of the table, are grouped, yieMany applications in data analysis study whether two categorical variables are independent
using a function of the entries of their contingency table. Often, the categories
of the variables, associated with the rows and columns of the table, are grouped, yielding
a less granular representation of the categorical variables. The purpose of this is
to attain reasonable sample sizes in the cells of the table and, more importantly, to
incorporate expert knowledge on the allowable groupings. However, it is known that
the conclusions on independence depend, in general, on the chosen granularity, as in
the Simpson paradox. In this paper we propose a methodology to, for a given contingency
table and a fixed granularity, find a clustered table with the highest χ2 statistic.
Repeating this procedure for different values of the granularity, we can either identify
an extreme grouping, namely the largest granularity for which the statistical dependence
is still detected, or conclude that it does not exist and that the two variables are
dependent regardless of the size of the clustered table. For this problem, we propose
an assignment mathematical formulation and a set partitioning one. Our approach is
flexible enough to include constraints on the desirable structure of the clusters, such as
must-link or cannot-link constraints on the categories that can, or cannot, be merged
together, and ensure reasonable sample sizes in the cells of the clustered table from
which trustful statistical conclusions can be derived. We illustrate the usefulness of
our methodology using a dataset of a medical study.[+][-]