Publication: Multimodal Affective Computing in Wearable Devices with Applications in the Detection of Gender-based Violence
Loading...
Identifiers
Publication date
2022-12-15
Defense date
2023-01-30
Authors
Advisors
Tutors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
According to the World Health Organization (WHO), 1 out of every 3 women
suffer from physical or sexual violence in some point of their lives, reflecting
the effect of Gender-based Violence (GBV) in the world. In particular in Spain,
more than 1, 100 women have been assassinated from 2003 to 2022, victims of
gender-based violence.
There is an urgent need for solutions to this prevailing problem in our society.
They may involve the appropriate investment in technological research, among
legislative, educational and economical efforts. An Artificial Intelligence (AI)
driven solution that made a comprehensive analysis of aspects such as the person’s
emotional state, plus a context or external situation analysis (e.g.: circumstances,
location) and therefore automatically detect when a woman’s security is in danger,
could provide an automatic and fast response to ensure women’s safety.
Thus, this PhD thesis stems from the need to detect gender-based violence risk
situations for women, addressing the problem from a multidisciplinary point of
view by bringing together Artificial Intelligence (AI) technologies and a gender
perspective. More specifically, we direct the focus to the auditory modality,
analysing speech data produced by the user given that voice can be recorded
unobtrusively, can be used as a personal identifier and indicator of affective sates
reflected in it.
The immediate response in a human being when in a situation of risk or danger
is the fight-flight-freeze response. Several physiological changes affect the body:
breathing, heart rate, muscle activation including the complex speech production
apparatus and vocalisation characteristics, affecting our speech production. Due to
all these physical and physiological changes and their involuntary nature as a result
of being in a situation of risk, we considered relying on physiological signals such
as pulse, perspiration, respiration, and also speech, in order to detect the emotional
state of a person with the intention of recognising fear, which could be a consequence
of being in a threatening situation. For such, we developed “Bindi”. This is a
an end-to-end, AI-driven, inconspicuous, connected, edge computation-based, and
wearable solution targeting the automatic detection of GBV situations. It consists
of two smart devices that monitor the physiological variables and the acoustic
environment including voice of an individual, connected to a smartphone and a
cloud system able to call for help.
Ideally, in order to build a Machine Learning or Deep Learning Artificial
Intelligence system for the automatic detection of risk situations from auditory data,
we would like to count on speech recorded under realistic conditions belonging to
the target user.
In our first steps, we found the difficulty of the lack of suitable data available,
as there were non-existent (or non-available) speech datasets of real fear (not acted)
currently in the literature. Real, original, spontaneous, in-the-wild and emotional
speech are the ideal categories we needed for our application. Therefore, we
decided to choose stress as the closest emotion to the target scenario possible for
data collection to be able to flesh out the algorithms and acquire the knowledge
needed. Thus, we describe and justify the use of datasets containing such emotion
as the starting point of our investigation. Additionally, we describe the need for the
creation of our own set of datasets to fill such literature niche. Then, members of our UC3M4Safety team captured the UC3M4Safety
Audiovisual Stimuli Database, a dataset of 42 audiovisual stimuli to elicit emotions.
Using them, we contributed to the community with the collection of WEMAC, a
multi-modal dataset, which comprises a laboratory-based experiment for women
volunteers that were exposed to the UC3M4Safety Audiovisual Stimuli Database.
It aims to induce real emotions by using a virtual reality headset while the user’s
physiological, speech signals and self-reports are collected.
But recording emotional speech in fearful conditions that is realistic and
spontaneous is very difficult, if not impossible. To get as close as possible
to these conditions and hopefully record fearful speech, the UC3M4Safety team
created the WE-LIVE database. With it we collected physiological, auditory and
contextual signals from women in real-life conditions, as well as the labelling of
their emotional reactions to everyday events in their lives, using the current Bindi
system (wristband, pendant, mobile application and server).
In order to detect GBV risk situations through speech, we first need to detect the
voice of the specific user we are interested in, a speaker recognition task, among all
the information contained in the audio signal. Thus, we aim to track the user’s voice
separating it from the rest of the speakers in the acoustic scenario, trying to avoid the
influence of emotions or ambient noise on the identification of the speaker as these
factors could be detrimental for it.
We study speaker recognition systems under two variability conditions,
1) speaker identification under stress conditions, to see how much these stress
conditions affect speaker recognition systems and, 2) speaker recognition under
real-life noisy conditions, isolating the speaker’s identity, among all additional
information contained in the audio signal.
We also dive into the development of the Bindi system for the recognition of
fear-related emotions. We describe the architectures in Bindi versions 1.0 and 2.0,
the evolution from one another, together with their implementation. We explain the
approach followed for the design of a cascade multimodal system for Bindi 1.0, and
also the design of a complete Internet of Things system with edge, fog and cloud
computing components, for Bindi 2.0; specifically detailing how we designed the
intelligence architectures in the Bindi devices for fear detection in the user.
We then perform monomodal inference first by targeting the detection of
realistic stress through speech. Later, as core experimentation, we work with
WEMAC for the task of fear detection using data fusion strategies. The
experimental results show an average accuracy of fear recognition of 63.61%
with the Leave-hAlf-Subject-Out (LASO) method, which is a speaker-adapted
subject-dependent training classification strategy. To the best of the UC3M4Safety
team’s knowledge, this is the first time that a multimodal fusion of physiological
and speech data for fear recognition has been given in this GBV context. Besides,
this is the first time a LASO model considering fear recognition, multisensorial
signal fusion, and virtual reality stimuli has been presented. We even explored
how the gender-based violence victim condition could be detected only by speech
paralinguistic cues.
Overall, this thesis explores the use of audio technology and artificial intelligence
to prevent and combat gender-based violence. We hope that we have lit the way for
it in the speech community and beyond and that our experimentation, findings and
conclusions can help in future research. The ultimate goal of this work is to ignite
the community’s interest in developing solutions to the very challenging problem of
GBV.
Description
Mención Internacional en el título de doctor
Keywords
Affective computing, Gender-based violence, Speech emotion recognition