Rights:
Atribución-NoComercial-SinDerivadas 3.0 España
Abstract:
Clustering is an old data analysis problem that has been extensively studied during
the last decades. However, there is not a single algorithm that provides a satisfactory result
for every data set. Moreover, there exist some problems related to cluster analClustering is an old data analysis problem that has been extensively studied during
the last decades. However, there is not a single algorithm that provides a satisfactory result
for every data set. Moreover, there exist some problems related to cluster analysis that also
remain unsolved. In this monograph we study some of such problems as they commonly appear
in practice, and test how they work when applied to gene expression data analysis, where
clustering is widely used.
Different clustering algorithms often lead to different results, and in order to make sense
out of them it is important to understand how clusters from one analysis relate to those from a
different one. A comparison method to find and visualize many-to-many relationships between
two clusterings, either two flat clusterings or a flat and a hierarchical clustering, is presented.
The similarities between clusters are represented by a weighted bipartite graph, where the
nodes are the clusters and an edge weight shows the number of elements in common to the
connected nodes. To visualize the relationships between clusterings the number of edge crossings
is minimized. When applied to the case of comparing a hierarchical and a flat clustering we use
a criterion based either on the graph layout aesthetics or in the mutual information, to decide
where to cut the hierarchical tree.
Since iterative methods are sensitive to the initial parameters, we have developed two refinement algorithms designed to improve this initial state, based on the notion of data depth.
One of these algorithms looks for initial points in the same data space, while the second one,
using the bootstrap technique, selects the initial seeds in a new space of bootstrap centroids.
Also, this second approach allows to construct a soft (non-hard) clustering of the data, that
assigns to each point a probability of belonging to each cluster, and thus a single point may
partially belong to more than one cluster.
On the other hand, the number of clusters underlying in a data set is usually unknown.
Using ideas from the clustering comparison method previously proposed and from the data
depth concept, we present three procedures to estimate the number of real groups. The first two
methods consist basically in sampling pairs of clusterings from a population and successively
performing comparisons between them to find a consensus in the number of clusters, and the
third one looks for representative subsets of the clusters whose diameter is used to estimate
the optimal number of real groups.
The extensive study we carried out in simulated and real gene expression data shows that
the techniques presented here are useful and e±cient. The results that we obtained with real
data make sense not only from a statistical point of view, but they have proven to have a biological meaning.
______________________________________________[+][-]
El análisis cluster es un antiguo problema revivido en las últimas décadas. En el trabajo presentado abordamos algunos problemas que aparecen en la práctica. Para entender los distintos resultados producidos por diferentes algoritmos es importante estudiar la El análisis cluster es un antiguo problema revivido en las últimas décadas. En el trabajo presentado abordamos algunos problemas que aparecen en la práctica. Para entender los distintos resultados producidos por diferentes algoritmos es importante estudiar la relación entre clusters procedentes de análisis diferentes, por lo que presentamos un método de comparación para visualizar relaciones entre clusterings jerárquicos o no-jerárquicos, basado en grafos, utilizando un criterio de estética o de información mutua para cortar los dendrogramas en el caso jerárquico. Desarrollamos dos algoritmos de refinamiento del estado inicial de métodos de clustering iterativos, utilizando el concepto de profundidad y bootstrap. Esto además permite desarrollar un algoritmo de clustering no rígido, asignando a los puntos probabilidades de pertenencia a los clusters. Para determinar el número de grupos de un conjunto (habitualmente desconocido) hemos utilizado ideas del método de comparación y el concepto de profundidad, desarrollando tres técnicas de estimación. Hemos realizado un estudio extensivo para todos los métodos propuestos en datos simulados y en datos de expresión génica, y hemos probado que las técnicas desarrolladas en este trabajo son útiles y eficientes, tanto desde un punto de vista estadístico como biológico[+][-]