Métodos de clustering en datos de expresión génica

Thumbnail Image
Publication date
Defense date
Journal Title
Journal ISSN
Volume Title
Google Scholar
Research Projects
Organizational Units
Journal Issue
Clustering is an old data analysis problem that has been extensively studied during the last decades. However, there is not a single algorithm that provides a satisfactory result for every data set. Moreover, there exist some problems related to cluster analysis that also remain unsolved. In this monograph we study some of such problems as they commonly appear in practice, and test how they work when applied to gene expression data analysis, where clustering is widely used. Different clustering algorithms often lead to different results, and in order to make sense out of them it is important to understand how clusters from one analysis relate to those from a different one. A comparison method to find and visualize many-to-many relationships between two clusterings, either two flat clusterings or a flat and a hierarchical clustering, is presented. The similarities between clusters are represented by a weighted bipartite graph, where the nodes are the clusters and an edge weight shows the number of elements in common to the connected nodes. To visualize the relationships between clusterings the number of edge crossings is minimized. When applied to the case of comparing a hierarchical and a flat clustering we use a criterion based either on the graph layout aesthetics or in the mutual information, to decide where to cut the hierarchical tree. Since iterative methods are sensitive to the initial parameters, we have developed two refinement algorithms designed to improve this initial state, based on the notion of data depth. One of these algorithms looks for initial points in the same data space, while the second one, using the bootstrap technique, selects the initial seeds in a new space of bootstrap centroids. Also, this second approach allows to construct a soft (non-hard) clustering of the data, that assigns to each point a probability of belonging to each cluster, and thus a single point may partially belong to more than one cluster. On the other hand, the number of clusters underlying in a data set is usually unknown. Using ideas from the clustering comparison method previously proposed and from the data depth concept, we present three procedures to estimate the number of real groups. The first two methods consist basically in sampling pairs of clusterings from a population and successively performing comparisons between them to find a consensus in the number of clusters, and the third one looks for representative subsets of the clusters whose diameter is used to estimate the optimal number of real groups. The extensive study we carried out in simulated and real gene expression data shows that the techniques presented here are useful and e±cient. The results that we obtained with real data make sense not only from a statistical point of view, but they have proven to have a biological meaning. ______________________________________________
El análisis cluster es un antiguo problema revivido en las últimas décadas. En el trabajo presentado abordamos algunos problemas que aparecen en la práctica. Para entender los distintos resultados producidos por diferentes algoritmos es importante estudiar la relación entre clusters procedentes de análisis diferentes, por lo que presentamos un método de comparación para visualizar relaciones entre clusterings jerárquicos o no-jerárquicos, basado en grafos, utilizando un criterio de estética o de información mutua para cortar los dendrogramas en el caso jerárquico. Desarrollamos dos algoritmos de refinamiento del estado inicial de métodos de clustering iterativos, utilizando el concepto de profundidad y bootstrap. Esto además permite desarrollar un algoritmo de clustering no rígido, asignando a los puntos probabilidades de pertenencia a los clusters. Para determinar el número de grupos de un conjunto (habitualmente desconocido) hemos utilizado ideas del método de comparación y el concepto de profundidad, desarrollando tres técnicas de estimación. Hemos realizado un estudio extensivo para todos los métodos propuestos en datos simulados y en datos de expresión génica, y hemos probado que las técnicas desarrolladas en este trabajo son útiles y eficientes, tanto desde un punto de vista estadístico como biológico
Análisis de conglomerados, Biología molecular, Métodos de clustering
Bibliographic citation