ASR Feature Extraction with Morphologically-Filtered Power-Normalized Cochleograms

Thumbnail Image
Publication date
Defense date
Journal Title
Journal ISSN
Volume Title
International Speech Communication Association
Google Scholar
Research Projects
Organizational Units
Journal Issue
In this paper we present advances in the modeling of the masking behavior of the Human Auditory System to enhance the robustness of the feature extraction stage in Automatic Speech Recognition. The solution adopted is based on a non-linear filtering of a spectro-temporal representation applied simultaneously on both the frequency and time domains, by processing it using mathematical morphology operations as if it were an image. A particularly important component of this architecture is the so called structuring element: biologically-based considerations are addressed in the present contribution to design an element that closely resembles the masking phenomena taking place in the cochlea. The second feature of this contribution is the choice of underlying spectro-temporal representation. The best results were achieved by the representation introduced as part of the Power Normalized Cepstral Coefficients together with a spectral subtraction step. On the Aurora 2 noisy continuous digits task, we report relative error reductions of 18.7% compared to PNCC and 39.5% compared to MFCC.
Proceedings of: 15th Annual Conference of the International Speech Communication Association. Singapore, September 14-18, 2014.
Spectro-temporal processing, Morphological filtering, Automatic speech recognition, Auditory-based features, PNCC
Bibliographic citation
Li, Haizhou, et al. (eds). (2014). INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014. (pp. 2430-2434). International Speech Communication Association.