Publication:
On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification

dc.affiliation.dptoUC3M. Departamento de Teoría de la Señal y Comunicacioneses
dc.affiliation.grupoinvUC3M. Grupo de Investigación: Procesado Multimediaes
dc.contributor.authorGallardo Antolín, Ascensión
dc.contributor.authorMontero, Juan Manuel
dc.contributor.funderMinisterio de Economía y Competitividad (España)es
dc.date.accessioned2021-11-29T10:41:41Z
dc.date.available2022-01-17T15:55:26Z
dc.date.issued2021-10-07
dc.description.abstractSpeech intelligibility can be affected by multiple factors, such as noisy environments, channel distortions or physiological issues. In this work, we deal with the problem of automatic prediction of the speech intelligibility level in this latter case. Starting from our previous work, a non-intrusive system based on LSTM networks with attention mechanism designed for this task, we present two main contributions. In the first one, it is proposed the use of per-frame modulation spectrograms as input features, instead of compact representations derived from them that discard important temporal information. In the second one, two different strategies for the combination of per-frame acoustic log-mel and modulation spectrograms into the LSTM framework are explored: at decision level or late fusion and at utterance level or Weighted-Pooling (WP) fusion. The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity. On the one hand, results show that attentional LSTM networks are able to adequately modeling the modulation spectrograms sequences producing similar classification rates as in the case of log-mel spectrograms. On the other hand, both combination strategies, late and WP fusion, outperform the single-feature systems, suggesting that per-frame log-mel and modulation spectrograms carry complementary information for the task of speech intelligibility prediction, than can be effectively exploited by the LSTM-based architectures, being the system with the WP fusion strategy and Attention-Pooling the one that achieves best results.en
dc.description.sponsorshipThe work leading to these results has been partly supported by the Spanish Government-MinECo under Projects TEC2017-84395-P and TEC2017-84593-C2-1-R.en
dc.description.statusPublicadoes
dc.format.extent11
dc.identifier.bibliographicCitationNeurocomputing, (2021), v. 456, pp. 49-60en
dc.identifier.doihttps://doi.org/10.1016/j.neucom.2021.05.065
dc.identifier.issn0925-2312
dc.identifier.publicationfirstpage49
dc.identifier.publicationlastpage60
dc.identifier.publicationtitleNeurocomputingen
dc.identifier.publicationvolume456
dc.identifier.urihttps://hdl.handle.net/10016/33704
dc.identifier.uxxiAR/0000028622
dc.language.isoeng
dc.publisherElsevier
dc.relation.projectIDGobierno de España. TEC2017-84395-P
dc.relation.projectIDGobierno de España. TEC2017-84593-C2-1-R
dc.relation.projectIDAT-2021
dc.rights© 2021 The Authors.en
dc.rightsAtribución-NoComercial-SinDerivadas 3.0 España*
dc.rights.accessRightsopen accessen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/es/*
dc.subject.ecienciaTelecomunicacioneses
dc.subject.otherSpeech intelligibilityen
dc.subject.otherLSTMen
dc.subject.otherAttention modelen
dc.subject.otherAcoustic spectrogramen
dc.subject.otherModulation spectrogramen
dc.subject.otherFusionen
dc.titleOn combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classificationen
dc.typeresearch article*
dc.type.hasVersionVoR*
dspace.entity.typePublication
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
combining_gallardo_N_2021.pdf
Size:
2.19 MB
Format:
Adobe Portable Document Format