On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification

Gallardo Antolín, Ascensión; Montero, Juan Manuel

Publication:
On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification

dc.affiliation.dpto	UC3M. Departamento de Teoría de la Señal y Comunicaciones	es
dc.affiliation.grupoinv	UC3M. Grupo de Investigación: Procesado Multimedia	es
dc.contributor.author	Gallardo Antolín, Ascensión
dc.contributor.author	Montero, Juan Manuel
dc.contributor.funder	Ministerio de Economía y Competitividad (España)	es
dc.date.accessioned	2021-11-29T10:41:41Z
dc.date.available	2022-01-17T15:55:26Z
dc.date.issued	2021-10-07
dc.description.abstract	Speech intelligibility can be affected by multiple factors, such as noisy environments, channel distortions or physiological issues. In this work, we deal with the problem of automatic prediction of the speech intelligibility level in this latter case. Starting from our previous work, a non-intrusive system based on LSTM networks with attention mechanism designed for this task, we present two main contributions. In the first one, it is proposed the use of per-frame modulation spectrograms as input features, instead of compact representations derived from them that discard important temporal information. In the second one, two different strategies for the combination of per-frame acoustic log-mel and modulation spectrograms into the LSTM framework are explored: at decision level or late fusion and at utterance level or Weighted-Pooling (WP) fusion. The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity. On the one hand, results show that attentional LSTM networks are able to adequately modeling the modulation spectrograms sequences producing similar classification rates as in the case of log-mel spectrograms. On the other hand, both combination strategies, late and WP fusion, outperform the single-feature systems, suggesting that per-frame log-mel and modulation spectrograms carry complementary information for the task of speech intelligibility prediction, than can be effectively exploited by the LSTM-based architectures, being the system with the WP fusion strategy and Attention-Pooling the one that achieves best results.	en
dc.description.sponsorship	The work leading to these results has been partly supported by the Spanish Government-MinECo under Projects TEC2017-84395-P and TEC2017-84593-C2-1-R.	en
dc.description.status	Publicado	es
dc.format.extent	11
dc.identifier.bibliographicCitation	Neurocomputing, (2021), v. 456, pp. 49-60	en
dc.identifier.doi	https://doi.org/10.1016/j.neucom.2021.05.065
dc.identifier.issn	0925-2312
dc.identifier.publicationfirstpage	49
dc.identifier.publicationlastpage	60
dc.identifier.publicationtitle	Neurocomputing	en
dc.identifier.publicationvolume	456
dc.identifier.uri	https://hdl.handle.net/10016/33704
dc.identifier.uxxi	AR/0000028622
dc.language.iso	eng
dc.publisher	Elsevier
dc.relation.projectID	Gobierno de España. TEC2017-84395-P
dc.relation.projectID	Gobierno de España. TEC2017-84593-C2-1-R
dc.relation.projectID	AT-2021
dc.rights	© 2021 The Authors.	en
dc.rights	Atribución-NoComercial-SinDerivadas 3.0 España	*
dc.rights.accessRights	open access	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/es/	*
dc.subject.eciencia	Telecomunicaciones	es
dc.subject.other	Speech intelligibility	en
dc.subject.other	LSTM	en
dc.subject.other	Attention model	en
dc.subject.other	Acoustic spectrogram	en
dc.subject.other	Modulation spectrogram	en
dc.subject.other	Fusion	en
dc.title	On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification	en
dc.type	research article	*
dc.type.hasVersion	VoR	*
dspace.entity.type	Publication