Personalized (almost) end-to-end speech synthesis
Text-To-Speech (TTS) systems traditionally encode linguistic and acoustic domain knowledge in form of vast codebases, hand-crafted rules and statistical models. Recent advances in machine learning led to the gradual replacement of individual components of such systems with neural networks. This talk highlights the most important aspects of this shift towards end-to-end synthesis, where (almost) the whole process of generating waveforms from text is performed by a neural network, inferring domain knowledge exclusively from data. The mechanics of prominent model architectures like WaveNet and Tacotron are presented and specific challenges of personalized speech synthesis, like speaker adaptation and multi-speaker models, are also addressed.