Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis

e-Archivo Repository

Show simple item record

dc.contributor.author González Calabozo, José M.
dc.contributor.author Valverde Albacete, Francisco José
dc.contributor.author Peláez Moreno, Carmen
dc.date.accessioned 2021-05-27T09:54:44Z
dc.date.available 2021-05-27T09:54:44Z
dc.date.issued 2016-09-15
dc.identifier.bibliographicCitation González-Calabozo, J. M., Valverde-Albacete, F. J. & Peláez-Moreno, C. (2016). Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis. BMC Bioinformatics, 17 (374).
dc.identifier.issn 1471-2105
dc.identifier.uri http://hdl.handle.net/10016/32773
dc.description.abstract Background: Gene Expression Data (GED) analysis poses a great challenge to the scientific community that can be framed into the Knowledge Discovery in Databases (KDD) and Data Mining (DM) paradigm. Biclustering has emerged as the machine learning method of choice to solve this task, but its unsupervised nature makes result assessment problematic. This is often addressed by means of Gene Set Enrichment Analysis (GSEA). Results: We put forward a framework in which GED analysis is understood as an Exploratory Data Analysis (EDA) process where we provide support for continuous human interaction with data aiming at improving the step of hypothesis abduction and assessment. We focus on the adaptation to human cognition of data interpretation and visualization of the output of EDA. First, we give a proper theoretical background to bi-clustering using Lattice Theory and provide a set of analysis tools revolving around K-Formal Concept Analysis (K-FCA), a lattice-theoretic unsupervised learning technique for real-valued matrices. By using different kinds of cost structures to quantify expression we obtain different sequences of hierarchical bi-clusterings for gene under- and over-expression using thresholds. Consequently, we provide a method with interleaved analysis steps and visualization devices so that the sequences of lattices for a particular experiment summarize the researcher’s vision of the data. This also allows us to define measures of persistence and robustness of biclusters to assess them. Second, the resulting biclusters are used to index external omics databases—for instance, Gene Ontology (GO)—thus offering a new way of accessing publicly available resources. This provides different flavors of gene set enrichment against which to assess the biclusters, by obtaining their p-values according to the terminology of those resources. We illustrate the exploration procedure on a real data example confirming results previously published. Conclusions: The GED analysis problem gets transformed into the exploration of a sequence of lattices enabling the visualization of the hierarchical structure of the biclusters with a certain degree of granularity. The ability of FCA-based bi-clustering methods to index external databases such as GO allows us to obtain a quality measure of the biclusters, to observe the evolution of a gene throughout the different biclusters it appears in, to look for relevant biclusters—by observing their genes and what their persistence is—to infer, for instance, hypotheses on their function.
dc.format.extent 15
dc.language.iso eng
dc.publisher BMC
dc.rights © 2016 The Author(s).
dc.rights Atribución 3.0 España
dc.rights.uri http://creativecommons.org/licenses/by/3.0/es/
dc.subject.other Biclustering
dc.subject.other Gene expression data
dc.subject.other Formal concept analysis
dc.subject.other Exploratory data analysis
dc.subject.other Gene set enrichment
dc.subject.other Knowledged discovery
dc.subject.other Data mining
dc.subject.other Concept lattices
dc.subject.other Network
dc.subject.other Cancer
dc.subject.other Genes
dc.title Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis
dc.type article
dc.subject.eciencia Telecomunicaciones
dc.identifier.doi https://doi.org/10.1186/s12859-016-1234-z
dc.rights.accessRights openAccess
dc.relation.projectID Gobierno de España. TEC2014-53390-P
dc.relation.projectID Gobierno de España. TEC2014-61729-EXP
dc.type.version publishedVersion
dc.identifier.publicationfirstpage 1
dc.identifier.publicationissue 374
dc.identifier.publicationlastpage 15
dc.identifier.publicationtitle BMC Bioinformatic
dc.identifier.publicationvolume 17
dc.identifier.uxxi AR/0000018260
dc.contributor.funder Ministerio de Economía y Competitividad (España)
 Find Full text

Files in this item

*Click on file's image for preview. (Embargoed files's preview is not supported)


The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record