Publication:
A Comparison of Open-Source Segmentation Architectures for Dealing with Imperfect Data from the Media in Speech Synthesis

Loading...
Thumbnail Image
Identifiers
Publication date
2014
Defense date
Advisors
Tutors
Journal Title
Journal ISSN
Volume Title
Publisher
International Speech Communication Association
Impact
Google Scholar
Export
Research Projects
Organizational Units
Journal Issue
Abstract
Traditional Text-To-Speech (TTS) systems have been developed using especially-designed non-expressive scripted recordings. In order to develop a new generation of expressive TTS systems in the Simple4All project, real recordings from the media should be used for training new voices with a whole new range of speaking styles. However, for processing this more spontaneous material, the new systems must be able to deal with imperfect data (multi-speaker recordings, background and fore-ground music and noise), filtering out low-quality audio segments and creating mono-speaker clusters. In this paper we compare several architectures for combining speaker diarization and music and noise detection which improve the precision and overall quality of the segmentation.
Description
Proceedings of: 15th Annual Conference of the International Speech Communication Association. Singapore, September 14-18, 2014.
Keywords
Diarization, Audio segmentation, Expressive text-to-speech, Media recordings
Bibliographic citation
Li, Haizhou, et al. (eds). (2014). INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014. (pp. 2370-2374). International Speech Communication Association.