This page is the demo of
- “A cyclical post-filtering approach to mismatch refinement of neural vocoder for text-to-speech systems” [paper] [highlight] [YouTube]
Abstract
Building an advanced TTS system from scratch is time and resource consuming. Therefore, we propose an economical post-filtering approach for existing low-cost TTS systems. However, this approach usually suffers from two issues: 1) temporal mismatches between TTS and natural waveforms and 2) acoustic mismatches between training and testing data. To address these issues, we adopt a cyclic voice conversion (Cycle-VC) model to generate temporally matched pseudo-VC data for training and acoustically matched enhanced data for testing the proposed neural-post-filter (NPF). Because of the generality, this framework can be applied to arbitrary TTS systems.
Acoustic mismatch
- Training the NPF with natural features and waveforms
- Testing the NPF with the synthetic features extracted from TTS-speech
Temporal mismatch
- Training the NPF with the synthetic features and natural waveforms
Cycle-VC
- Source: synthetic features
- Target: natural features
- Enhanced path: synthetic -> natural conversion
- Pseudo VC path: natural -> synthetic -> natural conversion
NPF training
- The temporal structures of the pseudo VC features and natural waveforms are matched
NPF testing
- The acoustic characteristics of the pseudo VC and enhanced features are similar
Demo Sound
- Testing with a DNN-based low-cost TTS
Vocoder | Female | Male |
---|---|---|
Natural | ||
DNN | ||
WN (UB *1) | ||
WN (AM *2) | ||
WN (TM *3) | ||
WN (NPF *4) | ||
PWG (UB *1) | ||
PWG (AM *2) | ||
PWG (TM *3) | ||
PWG (NPF *4) |
*1. UB: upper bound (natural features)
*2. AM: acoustic mismatch
*3. TM: temporal mismatch
*4. NPF: neural-post-filter
WN: WaveNet vocoder
PWG: Parallel WaveGAN vocoder
- Testing with a HMM-based low-cost TTS
Vocoder | Female | Male |
---|---|---|
Natural | ||
HMM | ||
WN (UB *1) | ||
WN (AM *2) | ||
WN (TM *3) | ||
WN (NPF *4) | ||
PWG (UB *1) | ||
PWG (AM *2) | ||
PWG (TM *3) | ||
PWG (NPF *4) |
Subjective Results
- NPF performance: MOS evaluation of speech quality
-
WN w/ NPF outperforms original low-cost TTS systems
-
PWG w/ NPF is comparable to WN w/ NPF
- Mismatch refinement: preference evaluation of speech quality
-
WN w/ NPF outperforms WN w/ AM or TM
- PWG w/ NPF is comparable to/better than PWG w/ AM or TM
Relative distances on MCD-plane
- DNN-based low-cost TTS
- HMM-based low-cost TTS
-
MCD of synthetic to natural is high -> natural and synthetic features are very different
-
MCD of enhanced to pseudo VC is much lower -> pseudo VC and enhanced features are similar
page layout is modified from cayman-theme and cayman-blog. LICENSE