A Comparison of Open-Source Segmentation Architectures for Dealing with Imperfect Data from the Media in Speech Synthesis

Gallardo Antolín, AscensiónMontero, Juan ManuelKing, Simon2015-07-302015-07-302014Li, Haizhou, et al. (eds). (2014). INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014. (pp. 2370-2374). International Speech Communication Association.9781634394352https://hdl.handle.net/10016/21478Proceedings of: 15th Annual Conference of the International Speech Communication Association. Singapore, September 14-18, 2014.Traditional Text-To-Speech (TTS) systems have been developed using especially-designed non-expressive scripted recordings. In order to develop a new generation of expressive TTS systems in the Simple4All project, real recordings from the media should be used for training new voices with a whole new range of speaking styles. However, for processing this more spontaneous material, the new systems must be able to deal with imperfect data (multi-speaker recordings, background and fore-ground music and noise), filtering out low-quality audio segments and creating mono-speaker clusters. In this paper we compare several architectures for combining speaker diarization and music and noise detection which improve the precision and overall quality of the segmentation.5application/pdfeng© 2014 ISCADiarizationAudio segmentationExpressive text-to-speechMedia recordingsA Comparison of Open-Source Segmentation Architectures for Dealing with Imperfect Data from the Media in Speech Synthesisconference posterTelecomunicacionesopen access23702374INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014.CC/0000022424