On mathematical optimization for clustering categories in contingency tables

Carrizosa, Emilio; Guerrero Lozano, Vanesa; Romero Morales, Dolores

Publication:
On mathematical optimization for clustering categories in contingency tables

Identifiers

URI: https://hdl.handle.net/10016/35451

ISSN: 1862-5347

DOI: https://doi.org/10.1007/s11634-022-00508-4

UXXI: AR/0000030950

Files

mathematical_ADAC_2022.pdf (681.15 KB)

Publication date

2022-06-28

Authors

Carrizosa, Emilio

Guerrero Lozano, Vanesa

Romero Morales, Dolores

Publisher

Springer

Impact

Export

Abstract

Many applications in data analysis study whether two categorical variables are independent using a function of the entries of their contingency table. Often, the categories of the variables, associated with the rows and columns of the table, are grouped, yielding a less granular representation of the categorical variables. The purpose of this is to attain reasonable sample sizes in the cells of the table and, more importantly, to incorporate expert knowledge on the allowable groupings. However, it is known that the conclusions on independence depend, in general, on the chosen granularity, as in the Simpson paradox. In this paper we propose a methodology to, for a given contingency table and a fixed granularity, find a clustered table with the highest χ2 statistic. Repeating this procedure for different values of the granularity, we can either identify an extreme grouping, namely the largest granularity for which the statistical dependence is still detected, or conclude that it does not exist and that the two variables are dependent regardless of the size of the clustered table. For this problem, we propose an assignment mathematical formulation and a set partitioning one. Our approach is flexible enough to include constraints on the desirable structure of the clusters, such as must-link or cannot-link constraints on the categories that can, or cannot, be merged together, and ensure reasonable sample sizes in the cells of the clustered table from which trustful statistical conclusions can be derived. We illustrate the usefulness of our methodology using a dataset of a medical study.

Keywords

Contingency Tables, Mathematical optimization, Relational constraints, Clustering

Bibliographic citation

Carrizosa, E., Guerrero, V., & Romero Morales, D. (2022). On mathematical optimization for clustering categories in contingency tables. Advances in Data Analysis and Classification.

Collections

DES - Artículos de Revistas

Full item page

Publication:
On mathematical optimization for clustering categories in contingency tables

Identifiers

Files

Publication date

Defense date

Authors

Advisors

Tutors

Journal Title

Journal ISSN

Volume Title

Publisher

Impact

Export

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Bibliographic citation

Collections

Publication: On mathematical optimization for clustering categories in contingency tables

Identifiers

Files

Publication date

Defense date

Authors

Advisors

Tutors

Journal Title

Journal ISSN

Volume Title

Publisher

Impact

Export

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Bibliographic citation

Collections

Publication:
On mathematical optimization for clustering categories in contingency tables