Publication:
Multimodal Affective Computing in Wearable Devices with Applications in the Detection of Gender-based Violence

Loading...
Thumbnail Image
Identifiers
Publication date
2022-12-15
Defense date
2023-01-30
Journal Title
Journal ISSN
Volume Title
Publisher
Impact
Google Scholar
Export
Research Projects
Organizational Units
Journal Issue
Abstract
According to the World Health Organization (WHO), 1 out of every 3 women suffer from physical or sexual violence in some point of their lives, reflecting the effect of Gender-based Violence (GBV) in the world. In particular in Spain, more than 1, 100 women have been assassinated from 2003 to 2022, victims of gender-based violence. There is an urgent need for solutions to this prevailing problem in our society. They may involve the appropriate investment in technological research, among legislative, educational and economical efforts. An Artificial Intelligence (AI) driven solution that made a comprehensive analysis of aspects such as the person’s emotional state, plus a context or external situation analysis (e.g.: circumstances, location) and therefore automatically detect when a woman’s security is in danger, could provide an automatic and fast response to ensure women’s safety. Thus, this PhD thesis stems from the need to detect gender-based violence risk situations for women, addressing the problem from a multidisciplinary point of view by bringing together Artificial Intelligence (AI) technologies and a gender perspective. More specifically, we direct the focus to the auditory modality, analysing speech data produced by the user given that voice can be recorded unobtrusively, can be used as a personal identifier and indicator of affective sates reflected in it. The immediate response in a human being when in a situation of risk or danger is the fight-flight-freeze response. Several physiological changes affect the body: breathing, heart rate, muscle activation including the complex speech production apparatus and vocalisation characteristics, affecting our speech production. Due to all these physical and physiological changes and their involuntary nature as a result of being in a situation of risk, we considered relying on physiological signals such as pulse, perspiration, respiration, and also speech, in order to detect the emotional state of a person with the intention of recognising fear, which could be a consequence of being in a threatening situation. For such, we developed “Bindi”. This is a an end-to-end, AI-driven, inconspicuous, connected, edge computation-based, and wearable solution targeting the automatic detection of GBV situations. It consists of two smart devices that monitor the physiological variables and the acoustic environment including voice of an individual, connected to a smartphone and a cloud system able to call for help. Ideally, in order to build a Machine Learning or Deep Learning Artificial Intelligence system for the automatic detection of risk situations from auditory data, we would like to count on speech recorded under realistic conditions belonging to the target user. In our first steps, we found the difficulty of the lack of suitable data available, as there were non-existent (or non-available) speech datasets of real fear (not acted) currently in the literature. Real, original, spontaneous, in-the-wild and emotional speech are the ideal categories we needed for our application. Therefore, we decided to choose stress as the closest emotion to the target scenario possible for data collection to be able to flesh out the algorithms and acquire the knowledge needed. Thus, we describe and justify the use of datasets containing such emotion as the starting point of our investigation. Additionally, we describe the need for the creation of our own set of datasets to fill such literature niche. Then, members of our UC3M4Safety team captured the UC3M4Safety Audiovisual Stimuli Database, a dataset of 42 audiovisual stimuli to elicit emotions. Using them, we contributed to the community with the collection of WEMAC, a multi-modal dataset, which comprises a laboratory-based experiment for women volunteers that were exposed to the UC3M4Safety Audiovisual Stimuli Database. It aims to induce real emotions by using a virtual reality headset while the user’s physiological, speech signals and self-reports are collected. But recording emotional speech in fearful conditions that is realistic and spontaneous is very difficult, if not impossible. To get as close as possible to these conditions and hopefully record fearful speech, the UC3M4Safety team created the WE-LIVE database. With it we collected physiological, auditory and contextual signals from women in real-life conditions, as well as the labelling of their emotional reactions to everyday events in their lives, using the current Bindi system (wristband, pendant, mobile application and server). In order to detect GBV risk situations through speech, we first need to detect the voice of the specific user we are interested in, a speaker recognition task, among all the information contained in the audio signal. Thus, we aim to track the user’s voice separating it from the rest of the speakers in the acoustic scenario, trying to avoid the influence of emotions or ambient noise on the identification of the speaker as these factors could be detrimental for it. We study speaker recognition systems under two variability conditions, 1) speaker identification under stress conditions, to see how much these stress conditions affect speaker recognition systems and, 2) speaker recognition under real-life noisy conditions, isolating the speaker’s identity, among all additional information contained in the audio signal. We also dive into the development of the Bindi system for the recognition of fear-related emotions. We describe the architectures in Bindi versions 1.0 and 2.0, the evolution from one another, together with their implementation. We explain the approach followed for the design of a cascade multimodal system for Bindi 1.0, and also the design of a complete Internet of Things system with edge, fog and cloud computing components, for Bindi 2.0; specifically detailing how we designed the intelligence architectures in the Bindi devices for fear detection in the user. We then perform monomodal inference first by targeting the detection of realistic stress through speech. Later, as core experimentation, we work with WEMAC for the task of fear detection using data fusion strategies. The experimental results show an average accuracy of fear recognition of 63.61% with the Leave-hAlf-Subject-Out (LASO) method, which is a speaker-adapted subject-dependent training classification strategy. To the best of the UC3M4Safety team’s knowledge, this is the first time that a multimodal fusion of physiological and speech data for fear recognition has been given in this GBV context. Besides, this is the first time a LASO model considering fear recognition, multisensorial signal fusion, and virtual reality stimuli has been presented. We even explored how the gender-based violence victim condition could be detected only by speech paralinguistic cues. Overall, this thesis explores the use of audio technology and artificial intelligence to prevent and combat gender-based violence. We hope that we have lit the way for it in the speech community and beyond and that our experimentation, findings and conclusions can help in future research. The ultimate goal of this work is to ignite the community’s interest in developing solutions to the very challenging problem of GBV.
Description
Mención Internacional en el título de doctor
Keywords
Affective computing, Gender-based violence, Speech emotion recognition
Bibliographic citation
Collections