This page is the demo of
- “A cyclical post-filtering approach to mismatch refinement of neural vocoder for text-to-speech systems” [paper] [highlight] [YouTube]
 
Abstract
Building an advanced TTS system from scratch is time and resource consuming. Therefore, we propose an economical post-filtering approach for existing low-cost TTS systems. However, this approach usually suffers from two issues: 1) temporal mismatches between TTS and natural waveforms and 2) acoustic mismatches between training and testing data. To address these issues, we adopt a cyclic voice conversion (Cycle-VC) model to generate temporally matched pseudo-VC data for training and acoustically matched enhanced data for testing the proposed neural-post-filter (NPF). Because of the generality, this framework can be applied to arbitrary TTS systems.
Acoustic mismatch
- Training the NPF with natural features and waveforms
 - Testing the NPF with the synthetic features extracted from TTS-speech
 
Temporal mismatch
- Training the NPF with the synthetic features and natural waveforms
 
Cycle-VC
- Source: synthetic features
 - Target: natural features
 - Enhanced path: synthetic -> natural conversion
 - Pseudo VC path: natural -> synthetic -> natural conversion
 
NPF training
- The temporal structures of the pseudo VC features and natural waveforms are matched
 
NPF testing
- The acoustic characteristics of the pseudo VC and enhanced features are similar
 
Demo Sound
- Testing with a DNN-based low-cost TTS
 
| Vocoder | Female | Male | 
|---|---|---|
| Natural | ||
| DNN | ||
| WN (UB *1) | ||
| WN (AM *2) | ||
| WN (TM *3) | ||
| WN (NPF *4) | ||
| PWG (UB *1) | ||
| PWG (AM *2) | ||
| PWG (TM *3) | ||
| PWG (NPF *4) | 
*1. UB: upper bound (natural features) 
*2. AM: acoustic mismatch 
*3. TM: temporal mismatch 
*4. NPF: neural-post-filter 
WN: WaveNet vocoder
PWG: Parallel WaveGAN vocoder
- Testing with a HMM-based low-cost TTS
 
| Vocoder | Female | Male | 
|---|---|---|
| Natural | ||
| HMM | ||
| WN (UB *1) | ||
| WN (AM *2) | ||
| WN (TM *3) | ||
| WN (NPF *4) | ||
| PWG (UB *1) | ||
| PWG (AM *2) | ||
| PWG (TM *3) | ||
| PWG (NPF *4) | 
Subjective Results
- NPF performance: MOS evaluation of speech quality
 - 
    
WN w/ NPF outperforms original low-cost TTS systems

 - 
    
PWG w/ NPF is comparable to WN w/ NPF

 - Mismatch refinement: preference evaluation of speech quality
 - 
    
WN w/ NPF outperforms WN w/ AM or TM

 - PWG w/ NPF is comparable to/better than PWG w/ AM or TM   

 
Relative distances on MCD-plane
- DNN-based low-cost TTS
 

- HMM-based low-cost TTS
 

- 
    
MCD of synthetic to natural is high -> natural and synthetic features are very different
 - 
    
MCD of enhanced to pseudo VC is much lower -> pseudo VC and enhanced features are similar
 
page layout is modified from cayman-theme and cayman-blog. LICENSE