This page is the demo of

  1. “A cyclical post-filtering approach to mismatch refinement of neural vocoder for text-to-speech systems” [paper] [highlight] [YouTube]

Abstract

Building an advanced TTS system from scratch is time and resource consuming. Therefore, we propose an economical post-filtering approach for existing low-cost TTS systems. However, this approach usually suffers from two issues: 1) temporal mismatches between TTS and natural waveforms and 2) acoustic mismatches between training and testing data. To address these issues, we adopt a cyclic voice conversion (Cycle-VC) model to generate temporally matched pseudo-VC data for training and acoustically matched enhanced data for testing the proposed neural-post-filter (NPF). Because of the generality, this framework can be applied to arbitrary TTS systems.

Acoustic mismatch

Temporal mismatch

Cycle-VC

NPF training

NPF testing

Demo Sound

Vocoder Female Male
Natural
DNN
WN (UB *1)
WN (AM *2)
WN (TM *3)
WN (NPF *4)
PWG (UB *1)
PWG (AM *2)
PWG (TM *3)
PWG (NPF *4)

*1. UB: upper bound (natural features)
*2. AM: acoustic mismatch
*3. TM: temporal mismatch
*4. NPF: neural-post-filter
WN: WaveNet vocoder
PWG: Parallel WaveGAN vocoder


Vocoder Female Male
Natural
HMM
WN (UB *1)
WN (AM *2)
WN (TM *3)
WN (NPF *4)
PWG (UB *1)
PWG (AM *2)
PWG (TM *3)
PWG (NPF *4)


Subjective Results


Relative distances on MCD-plane


Home




page layout is modified from cayman-theme and cayman-blog. LICENSE