DTSC - GPM - Artículos de Revistas

Permanent URI for this collection


Recent Submissions

Now showing 1 - 20 of 83
  • Publication
    Recycling weak labels for multiclass classification
    (Elsevier, 2020-08-04) Perello Nieto, Miquel; Santos Rodríguez, Raúl; García García, Dario; Cid Sueiro, Jesús; European Commission; Ministerio de Ciencia e Innovación (España)
    This paper explores the mechanisms to efficiently combine annotations of different quality for multiclass classification datasets, as we argue that it is easier to obtain large collections of weak labels as opposed to true labels. Since labels come from different sources, their annotations may have different degrees of reliability (e.g., noisy labels, supersets of labels, complementary labels or annotations performed by domain experts), and we must make sure that the addition of potentially inaccurate labels does not degrade the performance achieved when using only true labels. For this reason, we consider each group of annotations as being weakly supervised and pose the problem as finding the optimal combination of such collections. We propose an efficient algorithm based on expectation-maximization and show its performance in both synthetic and real-world classification tasks in a variety of weak label scenarios.
  • Publication
    Onset of schizophrenia diagnoses in a large clinical cohort
    (Springer, 2019-01-01) López Castroman, Jorge; Leiva Murillo, José Miguel; Cegla Schvartzman, Fanny; Blasco Fontecilla, Hilario; García Nieto, Rebeca; Artés Rodríguez, Antonio; Morant Ginestar, Consuelo; Courtet, Philippe; Blanco, Carlos; Aroca, Fuensanta; Baca García, Enrique; Comunidad de Madrid
    We aimed to describe the diagnostic patterns preceding and following the onset of schizophrenia diagnoses in outpatient clinics. A large clinical sample of 26,163 patients with a diagnosis of schizophrenia in at least one outpatient visit was investigated. We applied a Continuous Time Hidden Markov Model to describe the probability of transition from other diagnoses to schizophrenia considering time proximity. Although the most frequent diagnoses before schizophrenia were anxiety and mood disorders, direct transitions to schizophrenia usually came from psychotic-spectrum disorders. The initial diagnosis of schizophrenia was not likely to change for two of every three patients if it was confirmed some months after its onset. When not confirmed, the most frequent alternative diagnoses were personality, affective or non-schizophrenia psychotic disorders. Misdiagnosis or comorbidity with affective, anxiety and personality disorders are frequent before and after the diagnosis of schizophrenia. Our findings give partial support to a dimensional view of schizophrenia and emphasize the need for longitudinal assessment.
  • Publication
    Bindi: Affective internet of things to combat gender-based violence
    (IEEE, 2022-11-01) Miranda Calero, José Ángel; Rituerto González, Esther; Luis Mingueza, Clara; Canabal Benito, Manuel Felipe; Ramírez Bárcenas, Alberto; Lanza Gutiérrez, José Manuel; Peláez Moreno, Carmen; López Ongil, Celia; Comunidad de Madrid; Ministerio de Ciencia e Innovación (España); Universidad Carlos III de Madrid
    The main research motivation of this article is the fight against gender-based violence and achieving gender equality from a technological perspective. The solution proposed in this work goes beyond currently existing panic buttons, needing to be manually operated by the victims under difficult circumstances. Instead, Bindi, our end-to-end autonomous multimodal system, relies on artificial intelligence methods to automatically identify violent situations, based on detecting fear-related emotions, and trigger a protection protocol, if necessary. To this end, Bindi integrates modern state-of-the-art technologies, such as the Internet of Bodies, affective computing, and cyber-physical systems, leveraging: 1) affective Internet of Things (IoT) with auditory and physiological commercial off-the-shelf smart sensors embedded in wearable devices; 2) hierarchical multisensorial information fusion; and 3) the edge-fog-cloud IoT architecture. This solution is evaluated using our own data set named WEMAC, a very recently collected and freely available collection of data comprising the auditory and physiological responses of 47 women to several emotions elicited by using a virtual reality environment. On this basis, this work provides an analysis of multimodal late fusion strategies to combine the physiological and speech data processing pipelines to identify the best intelligence engine strategy for Bindi. In particular, the best data fusion strategy reports an overall fear classification accuracy of 63.61% for a subject-independent approach. Both a power consumption study and an audio data processing pipeline to detect violent acoustic events complement this analysis. This research is intended as an initial multimodal baseline that facilitates further work with real-life elicited fear in women.
  • Publication
    ACME: Automatic feature extraction for cell migration examination through intravital microscopy imaging
    (Elsevier, 2022-04-01) Molina Moreno, Miguel; González Díaz, Iván; Sicilia, Jon; Crainiciuc, Georgiana; Palomino Segura, Miguel; Hidalgo, Andrés; Díaz de María, Fernando; Comunidad de Madrid; Ministerio de Economía y Competitividad (España); Ministerio de Educación, Cultura y Deporte (España); Universidad Carlos III de Madrid
    Cell detection and tracking applied to in vivo fluorescence microscopy has become an essential tool in biomedicine to characterize 4D (3D space plus time) biological processes at the cellular level. Traditional approaches to cell motion analysis by microscopy imaging, although based on automatic frameworks, still require manual supervision at some points of the system. Hence, when dealing with a large amount of data, the analysis becomes incredibly time-consuming and typically yields poor biological information. In this paper, we propose a fully-automated system for segmentation, tracking and feature extraction of migrating cells within blood vessels in 4D microscopy imaging. Our system consists of a robust 3D convolutional neural network (CNN) for joint blood vessel and cell segmentation, a 3D tracking module with collision handling, and a novel method for feature extraction, which takes into account the particular geometry in the cell-vessel arrangement. Experiments on a large 4D intravital microscopy dataset show that the proposed system achieves a significantly better performance than the state-of-the-art tools for cell segmentation and tracking. Furthermore, we have designed an analytical method of cell behaviors based on the automatically extracted features, which supports the hypotheses related to leukocyte migration posed by expert biologists. This is the first time that such a comprehensive automatic analysis of immune cell migration has been performed, where the total population under study reaches hundreds of neutrophils and thousands of time instances.
  • Publication
    End-to-end recurrent denoising autoencoder embeddings for speaker identification
    (Springer, 2021-05-10) Rituerto González, Esther; Peláez Moreno, Carmen; Comunidad de Madrid
    Speech -in-the-wild- is a handicap for speaker recognition systems due to the variability induced by real-life conditions, such as environmental noise and the emotional state of the speaker. Taking advantage of the principles of representation learning, we aim to design a recurrent denoising autoencoder that extracts robust speaker embeddings from noisy spectrograms to perform speaker identification. The end-to-end proposed architecture uses a feedback loop to encode information regarding the speaker into low-dimensional representations extracted by a spectrogram denoising autoencoder. We employ data augmentation techniques by additively corrupting clean speech with real-life environmental noise in a database containing real stressed speech. Our study presents that the joint optimization of both the denoiser and speaker identification modules outperforms independent optimization of both components under stress and noise distortions as well as handcrafted features.
  • Publication
    DermaKNet: Incorporating the Knowledge of Dermatologists to Convolutional Neural Networks for Skin Lesion Diagnosis
    (IEEE, 2019-03-01) González Díaz, Iván; Ministerio de Economía y Competitividad (España)
    Traditional approaches to automatic diagnosis of skin lesions consisted of classifiers working on sets of hand-crafted features, some of which modeled lesion aspects of special importance for dermatologists. Recently, the broad adoption of convolutional neural networks (CNNs) in most computer vision tasks has brought about a great leap forward in terms of performance. Nevertheless, with this performance leap, the CNN-based computer-aided diagnosis (CAD) systems have also brought a notable reduction of the useful insights provided by hand-crafted features. This paper presents DermaKNet, a CAD system based on CNNs that incorporates specific subsystems modeling properties of skin lesions that are of special interest to dermatologists aiming to improve the interpretability of its diagnosis. Our results prove that the incorporation of these subsystems not only improves the performance, but also enhances the diagnosis by providing more interpretable outputs.
  • Publication
    Perceptually-guided deep neural networks for ego-action prediction: Object grasping
    (Elsevier, 2019-04-01) González Díaz, Iván; Benois-Pineau, Jenny; Domenger, Jean Phillippe; Cattaert, Daniel; de Rugby, Aymar; Ministerio de Economía y Competitividad (España)
    We tackle the problem of predicting a grasping action in ego-centric video for the assistance to upper limb amputees. Our work is based on paradigms of neuroscience that state that human gaze expresses intention and anticipates actions. In our scenario, human gaze fixations are recorded by a glass-worn eye-tracker and then used to predict the grasping actions. We have studied two aspects of the problem: which object from a given taxonomy will be grasped, and when is the moment to trigger the grasping action. To recognize objects, we using gaze to guide Convolutional Neural Networks (CNN) to focus on an object-to-grasp area. However, the acquired sequence of fixations is noisy due to saccades toward distractors and visual fatigue, and gaze is not always reliably directed toward the object-of-interest. To deal with this challenge, we use video-level annotations indicating the object to be grasped and a weak loss in Deep CNNs. To detect a moment when a person will take an object we take advantage of the predictive power of Long-Short Term Memory networks to analyze gaze and visual dynamics. Results show that our method achieves better performance than other approaches on a real-life dataset. (C) 2018 Elsevier Ltd. All rights reserved.
  • Publication
    An Integrated Millimeter-Wave Satellite Radiometer Working at Room-Temperature with High Photon Conversion Efficiency
    (MDPI AG, 2022-03-21) Dawoud, Kerlos Atia Abdalmalak; Santamaría Botello, Gabriel Arturo; Suresh, Mallika Irene; Falcón Gómez, Enderson; Rivera Lavado, Alejandro; García Muñoz, Luis Enrique; Comunidad de Madrid; Ministerio de Ciencia e Innovación (España)
    In this work, the design of an integrated 183 GHz radiometer frontend for earth observation applications on satellites is presented. By means of the efficient electro-optic modulation of a laser pump with the observed millimeter-wave signal followed by the detection of the generated optical sideband, a room-temperature low-noise receiver frontend alternative to conventional Low Noise Amplifiers (LNAs) or Schottky mixers is proposed. Efficient millimeter-wave to 1550 nm upconversion is realized via a nonlinear optical process in a triply resonant high-Q Lithium Niobate (LN) Whispering Gallery Mode (WGM) resonator. By engineering a micromachined millimeter-wave cavity that maximizes the overlap with the optical modes while guaranteeing phase matching, the system has a predicted normalized photon-conversion efficiency = 10-1 per mW pump power, surpassing the state-of-the-art by around three orders of magnitude at millimeter-wave frequencies. A piezo-driven millimeter-wave tuning mechanism is designed to compensate for the fabrication and assembly tolerances and reduces the complexity of the manufacturing process.
  • Publication
    Standing-Wave Feeding for High-Gain Linear Dielectric Resonator Antenna (DRA) Array
    (MDPI AG, 2022-04-15) Dawoud, Kerlos Atia Abdalmalak; Althuwayb, Ayman Abdulhadi; Choon, Sae Lee; Santamaría Botello, Gabriel Arturo; Falcón Gómez, Enderson; García Castillo, Luis Emilio; García Muñoz, Luis Enrique; Comunidad de Madrid; Ministerio de Ciencia e Innovación (España)
    A novel feeding method for linear DRA arrays is presented, illuminating the use of the power divider, transitions, and launchers, and keeping uniform excitation to array elements. This results in a high-gain DRA array with low losses with a design that is simple, compact and inexpensive. The proposed feeding method is based on exciting standing waves using discrete metallic patches in a simple design procedure. Two arrays with two and four DRA elements are presented as a proof of concept, which provide high gains of 12 and 15 dBi, respectively, which are close to the theoretical limit based on array theory. The radiation efficiency for both arrays is about 93%, which is equal to the array element efficiency, confirming that the feeding method does not add losses as in the case of standard methods. To facilitate the fabrication process, the entire array structure is 3D-printed, which significantly decreases the complexity of fabrication and alignment. Compared to state-of-the-art feeding techniques, the proposed method provides higher gain and higher efficiency with a smaller electrical size.
  • Publication
    Teaching differently: The digital signal processing of multimedia content through the use of liberal arts
    (IEEE, 2021-05) Torres Gómez, Jorge; Rodríguez Hidalgo, Antonio; Jerez Naranjo, Yannelys Virginia; Peláez Moreno, Carmen; Ministerio de Economía y Competitividad (España)
    Generally, the curriculum design for undergraduate students enrolled in digital signal processing (DSP)-related engineering programs covers hard topics from specific disciplines, namely, mathematics, digital electronics, or programming. Typically, these topics are very demanding from the point of view of both students and teachers due to the inherent complexity of the mathematical formulations. However, improvements to the effectiveness of teaching can be achieved through a multisensorial approach supported by the liberal arts. By including the development of art and literacy skills in the curriculum design, the fundamentals of DSP topics may be taught from a qualitative perspective, compared to the solely analytical standpoint taken by traditional curricula. We postulate that this approach increases both the comprehension and memorization of abstract concepts by stimulating students' creativity and curiosity. In this article, we elaborate upon a methodology that incorporates liberal arts concepts into the teaching of signal processing techniques. We also illustrate the application of this methodology through specific classroom activities related to the digital processing of multimedia contents in undergraduate academic programmes. With this proposal, we also aim to lessen the perceived difficulty of the topic, stimulate critical thinking, and establish a framework within which nonengineering departments may contribute to the teaching of engineering subjects.
  • Publication
    Interpretable global-local dynamics for the prediction of eye fixations in autonomous driving scenarios
    (IEEE, 2020-12-01) Martinez Cebrian, Javier; Fernández Torres, Miguel Ángel; Díaz de María, Fernando
    Human eye movements while driving reveal that visual attention largely depends on the context in which it occurs. Furthermore, an autonomous vehicle which performs this function would be more reliable if its outputs were understandable. Capsule Networks have been presented as a great opportunity to explore new horizons in the Computer Vision field, due to their capability to structure and relate latent information. In this article, we present a hierarchical approach for the prediction of eye fixations in autonomous driving scenarios. Context-driven visual attention can be modeled by considering different conditions which, in turn, are represented as combinations of several spatio-temporal features. With the aim of learning these conditions, we have built an encoder-decoder network which merges visual features' information using a global-local definition of capsules. Two types of capsules are distinguished: representational capsules for features and discriminative capsules for conditions. The latter and the use of eye fixations recorded with wearable eye tracking glasses allow the model to learn both to predict contextual conditions and to estimate visual attention, by means of a multi-task loss function. Experiments show how our approach is able to express either frame-level (global) or pixel-wise (local) relationships between features and contextual conditions, allowing for interpretability while maintaining or improving the performance of black-box related systems in the literature. Indeed, our proposal offers an improvement of 29% in terms of information gain with respect to the best performance reported in the literature.
  • Publication
    An auditory saliency pooling-based LSTM model for speech intelligibility classification
    (MDPI, 2021-09) Gallardo Antolín, Ascensión; Montero, Juan Manuel; Ministerio de Economía y Competitividad (España); Universidad Carlos III de Madrid
    Speech intelligibility is a crucial element in oral communication that can be influenced by multiple elements, such as noise, channel characteristics, or speech disorders. In this paper, we address the task of speech intelligibility classification (SIC) in this last circumstance. Taking our previous works, a SIC system based on an attentional long short-term memory (LSTM) network, as a starting point, we deal with the problem of the inadequate learning of the attention weights due to training data scarcity. For overcoming this issue, the main contribution of this paper is a novel type of weighted pooling (WP) mechanism, called saliency pooling where the WP weights are not automatically learned during the training process of the network, but are obtained from an external source of information, the Kalinli’s auditory saliency model. In this way, it is intended to take advantage of the apparent symmetry between the human auditory attention mechanism and the attentional models integrated into deep learning networks. The developed systems are assessed on the UA-speech dataset that comprises speech uttered by subjects with several dysarthria levels. Results show that all the systems with saliency pooling significantly outperform a reference support vector machine (SVM)-based system and LSTM-based systems with mean pooling and attention pooling, suggesting that Kalinli’s saliency can be successfully incorporated into the LSTM architecture as an external cue for the estimation of the speech intelligibility level.
  • Publication
    On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification
    (Elsevier, 2021-10-07) Gallardo Antolín, Ascensión; Montero, Juan Manuel; Ministerio de Economía y Competitividad (España)
    Speech intelligibility can be affected by multiple factors, such as noisy environments, channel distortions or physiological issues. In this work, we deal with the problem of automatic prediction of the speech intelligibility level in this latter case. Starting from our previous work, a non-intrusive system based on LSTM networks with attention mechanism designed for this task, we present two main contributions. In the first one, it is proposed the use of per-frame modulation spectrograms as input features, instead of compact representations derived from them that discard important temporal information. In the second one, two different strategies for the combination of per-frame acoustic log-mel and modulation spectrograms into the LSTM framework are explored: at decision level or late fusion and at utterance level or Weighted-Pooling (WP) fusion. The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity. On the one hand, results show that attentional LSTM networks are able to adequately modeling the modulation spectrograms sequences producing similar classification rates as in the case of log-mel spectrograms. On the other hand, both combination strategies, late and WP fusion, outperform the single-feature systems, suggesting that per-frame log-mel and modulation spectrograms carry complementary information for the task of speech intelligibility prediction, than can be effectively exploited by the LSTM-based architectures, being the system with the WP fusion strategy and Attention-Pooling the one that achieves best results.
  • Publication
    Detecting deception from gaze and speech using a multimodal attention LSTM-based framework
    (MDPI, 2021-07-02) Gallardo Antolín, Ascensión; Montero, Juan Manuel; Comunidad de Madrid; Ministerio de Economía y Competitividad (España)
    The automatic detection of deceptive behaviors has recently attracted the attention of the research community due to the variety of areas where it can play a crucial role, such as security or criminology. This work is focused on the development of an automatic deception detection system based on gaze and speech features. The first contribution of our research on this topic is the use of attention Long Short-Term Memory (LSTM) networks for single-modal systems with frame-level features as input. In the second contribution, we propose a multimodal system that combines the gaze and speech modalities into the LSTM architecture using two different combination strategies: Late Fusion and Attention-Pooling Fusion. The proposed models are evaluated over the Bag-of-Lies dataset, a multimodal database recorded in real conditions. On the one hand, results show that attentional LSTM networks are able to adequately model the gaze and speech feature sequences, outperforming a reference Support Vector Machine (SVM)-based system with compact features. On the other hand, both combination strategies produce better results than the single-modal systems and the multimodal reference system, suggesting that gaze and speech modalities carry complementary information for the task of deception detection that can be effectively exploited by using LSTMs
  • Publication
    An attention Long Short-Term Memory based system for automatic classification of speech intelligibility
    (Elsevier, 2020-11) Fernandez Diaz, Miguel; Gallardo Antolín, Ascensión; Ministerio de Ciencia e Innovación (España)
    Speech intelligibility can be degraded due to multiple factors, such as noisy environments, technical difficulties or biological conditions. This work is focused on the development of an automatic non-intrusive system for predicting the speech intelligibility level in this latter case. The main contribution of our research on this topic is the use of Long Short-Term Memory (LSTM) networks with log-mel spectrograms as input features for this purpose. In addition, this LSTM-based system is further enhanced by the incorporation of a simple attention mechanism that is able to determine the more relevant frames to this task. The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity. Results show that the attention LSTM architecture outperforms both, a reference Support Vector Machine (SVM)-based system with hand-crafted features and a LSTM-based system with Mean-Pooling.
  • Publication
    Automatic Detection of Depression in Speech Using Ensemble Convolutional Neural Networks
    (MDPI, 2020-06-20) Vázquez Romero, Adrián; Gallardo Antolín, Ascensión; Ministerio de Economía y Competitividad (España)
    This paper proposes a speech-based method for automatic depression classification. The system is based on ensemble learning for Convolutional Neural Networks (CNNs) and is evaluated using the data and the experimental protocol provided in the Depression Classification Sub-Challenge (DCC) at the 2016 Audio–Visual Emotion Challenge (AVEC-2016). In the pre-processing phase, speech files are represented as a sequence of log-spectrograms and randomly sampled to balance positive and negative samples. For the classification task itself, first, a more suitable architecture for this task, based on One-Dimensional Convolutional Neural Networks, is built. Secondly, several of these CNN-based models are trained with different initializations and then the corresponding individual predictions are fused by using an Ensemble Averaging algorithm and combined per speaker to get an appropriate final decision. The proposed ensemble system achieves satisfactory results on the DCC at the AVEC-2016 in comparison with a reference system based on Support Vector Machines and hand-crafted features, with a CNN+LSTM-based system called DepAudionet, and with the case of a single CNN-based classifier.
  • Publication
    Design of an embedded speech-centric interface for applications in handheld terminals
    (IEEE, 2013-02) Gallardo Antolín, Ascensión; García Moral, Ana Isabel; Pereiro Estevan, Yago; Díaz de María, Fernando; Universidad Carlos III de Madrid; Comunidad de Madrid
    The embedded speech-centric interface for handheld wireless devices has been implemented on a commercially available PDA as a part of an application that allows real-time access to stock prices through GPRS. In this article, we have focused mainly in the optimization of the ASR subsystem for minimizing the use of the handheld computational resources. This optimization has been accomplished through the fixed-point implementation of all the algorithms involved in the ASR subsystem and the use of PCA to reduce the feature vector dimensionality. The influence of several parameters, such as the Qn resolution in the fixed-point implementation and the number of PCA components retained, have been studied and evaluated in the ASR subsystem, obtaining word recognition rates of around 96% for the best configuration. Finally, a field evaluation of the system has been performed showing that our design of the speech centric interface achieved good results in a real-life scenario.
  • Publication
    Histogram Equalization-Based Features for Speech, Music and Song Discrimination
    (IEEE, 2010-07) Gallardo Antolín, Ascensión; Montero, Juan Manuel; Ministerio de Economía y Competitividad (España); Comunidad de Madrid; Universidad Carlos III de Madrid; Ministerio de Educación, Cultura y Deporte (España)
    In this letter, we present a new class of segment-based features for speech, music and song discrimination. These features, called PHEQ (Polynomial-Fit Histogram Equalization), are derived from the nonlinear relationship between the short-term feature distributions computed at segment level and a reference distribution. Results show that PHEQ characteristics outperform short-term features such as Mel Frequency Cepstrum Coefficients (MFCC) and conventional segment-based ones such as MFCC mean and variance. Furthermore, the combination of short-term and PHEQ features significantly improves the performance of the whole system.
  • Publication
    A simulated annealing approach to speaker segmentation in audio databases
    (Elsevier, 2008-06) Leiva Murillo, José Miguel; Salcedo Sanz, Sancho; Gallardo Antolín, Ascensión; Artés Rodríguez, Antonio
    In this paper we present a novel approach to the problem of speaker segmentation, which is an unavoidable previous step to audio indexing. Mutual information is used for evaluating the accuracy of the segmentation, as a function to be maximized by a simulated annealing (SA) algorithm. We introduce a novel mutation operator for the SA, the Consecutive Bits Mutation operator, which improves the performance of the SA in this problem. We also use the so-called Compaction Factor, which allows the SA to operate in a reduced search space. Our algorithm has been tested in the segmentation of real audio databases, and it has been compared to several existing algorithms for speaker segmentation, obtaining very good results in the test problems considered.
  • Publication
    Offline speaker segmentation using genetic algorithms and mutual information
    (IEEE, 2006-04) Salcedo Sanz, Sancho; Gallardo Antolín, Ascensión; Leiva Murillo, José Miguel; Bousoño Calzón, Carlos
    We present an evolutionary approach to speaker segmentation, an activity that is especially important prior to speaker recognition and audio content analysis tasks. Our approach consists of a genetic algorithm (GA), which encodes possible segmentations of an audio record, and a measure of mutual information between the audio data and possible segmentations, which is used as fitness function for the GA. We introduce a compact encoding of the problem into the GA which reduces the length of the GA individuals and improves the GA convergence properties. Our algorithm has been tested on the segmentation of real audio data, and its performance has been compared with several existing algorithms for speaker segmentation, obtaining very good results in all test problems.