This page is the demo of
- “Non-parallel voice conversion system with WaveNet vocoder and collapsed speech suppression” [paper]
- “Collapsed speech segment detection and suppression for WaveNet vocoder” [paper] [code]
- “The NU non-parallel voice conversion system for the voice conversion challenge 2018” [paper]
Abstract
We integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN vocoder with a VC system, the distorted acoustic features, acoustic and temporal mismatches, and exposure bias usually lead to significant speech quality degradation, making WN generate some very noisy speech segments called collapsed speech. To tackle the problem, we take conventional-vocoder-generated speech as the reference speech to derive a linear predictive coding distribution constraint (LPCDC) to avoid the collapsed speech problem. Furthermore, to mitigate the negative effects introduced by the LPCDC, we propose a collapsed speech segment detector (CSSD) to ensure that the LPCDC is only applied to the problematic segments to limit the loss of quality to short periods.
Testing corpus: VCC2018
Collapsed speech problem
- Type I: white-noise-like segment
- Type II: short impulse noise
Figure (a): WN-generated waveforms w/ collapsed speech.
Figure (b): WN-generated waveforms w/ LPCDC and CSSD.
Collapsed speech segment detection (CSSD)
- Waveform envelope extraction
- Compared with the reference
Figure (a): WN-generated waveforms w/ collapsed speech.
Figure (b): WORLD-generated waveforms (reference).
Figure (c): Extracted waveform envelopes.
Figure (d): Difference in waveform envelope.
LPC distribution constraint (LPCDC)
- Constrain the WN-predicted PMF with the LPC PMF
WN vocoder with CSSD and LPCDC
- The flowchart of the proposed WN vocoder with the collapsed speech suppression
Speaker voice conversion (Non-parallel)
- Intra gender conversion
Vocoder | Female (SF4->TF1) | Male (SM3->TM1) |
---|---|---|
Source | ||
Target | ||
Target + WN | ||
Collapsed-free | Collapsed-free | |
DNN + WN | ||
DMDN + WN | ||
DMDN + WORLD | ||
DMDN + LPCDC | ||
Collapsed | Collapsed | |
DNN + WN | ||
DMDN + WN | ||
DMDN + WORLD | ||
DMDN + LPCDC | ||
DMDN + LPCDC + CSSD |
- Inter gender conversion
Vocoder | Female (SF3->TM2) | Male (SM4->TF2) |
---|---|---|
Source | ||
Target | ||
Target + WN | ||
Collapsed-free | Collapsed-free | |
DNN + WN | ||
DMDN + WN | ||
DMDN + WORLD | ||
DMDN + LPCDC | ||
Collapsed | Collapsed | |
DNN + WN | ||
DMDN + WN | ||
DMDN + WORLD | ||
DMDN + LPCDC | ||
DMDN + LPCDC + CSSD |
- Subjective evaluation I: MOS of naturalness.
- Subjective evaluation II: Speaker similarity
page layout is modified from cayman-theme and cayman-blog. LICENSE