Cyclical Neural-Post-Filter

This page is the demo of

“A cyclical post-filtering approach to mismatch refinement of neural vocoder for text-to-speech systems” [paper] [highlight] [YouTube]

Abstract

Building an advanced TTS system from scratch is time and resource consuming. Therefore, we propose an economical post-filtering approach for existing low-cost TTS systems. However, this approach usually suffers from two issues: 1) temporal mismatches between TTS and natural waveforms and 2) acoustic mismatches between training and testing data. To address these issues, we adopt a cyclic voice conversion (Cycle-VC) model to generate temporally matched pseudo-VC data for training and acoustically matched enhanced data for testing the proposed neural-post-filter (NPF). Because of the generality, this framework can be applied to arbitrary TTS systems.

Acoustic mismatch

Training the NPF with natural features and waveforms
Testing the NPF with the synthetic features extracted from TTS-speech

Temporal mismatch

Training the NPF with the synthetic features and natural waveforms

Cycle-VC

Source: synthetic features
Target: natural features
Enhanced path: synthetic -> natural conversion
Pseudo VC path: natural -> synthetic -> natural conversion

NPF training

The temporal structures of the pseudo VC features and natural waveforms are matched

NPF testing

The acoustic characteristics of the pseudo VC and enhanced features are similar

Demo Sound

Testing with a DNN-based low-cost TTS

Vocoder	Female	Male
Natural
DNN
WN (UB^*1)
WN (AM^*2)
WN (TM^*3)
WN (NPF^*4)
PWG (UB^*1)
PWG (AM^*2)
PWG (TM^*3)
PWG (NPF^*4)

^{*1. UB: upper bound (natural features)}
^{*2. AM: acoustic mismatch}
^{*3. TM: temporal mismatch}
^{*4. NPF: neural-post-filter}
WN: WaveNet vocoder
PWG: Parallel WaveGAN vocoder

Testing with a HMM-based low-cost TTS

Vocoder	Female	Male
Natural
HMM
WN (UB^*1)
WN (AM^*2)
WN (TM^*3)
WN (NPF^*4)
PWG (UB^*1)
PWG (AM^*2)
PWG (TM^*3)
PWG (NPF^*4)

Subjective Results

NPF performance: MOS evaluation of speech quality
WN w/ NPF outperforms original low-cost TTS systems
PWG w/ NPF is comparable to WN w/ NPF
Mismatch refinement: preference evaluation of speech quality
WN w/ NPF outperforms WN w/ AM or TM
PWG w/ NPF is comparable to/better than PWG w/ AM or TM

Relative distances on MCD-plane

DNN-based low-cost TTS

HMM-based low-cost TTS

MCD of synthetic to natural is high -> natural and synthetic features are very different
MCD of enhanced to pseudo VC is much lower -> pseudo VC and enhanced features are similar

Home

page layout is modified from cayman-theme and cayman-blog. LICENSE