WaveNet Vocoder with Collapsed Speech Suppression

This page is the demo of

“Non-parallel voice conversion system with WaveNet vocoder and collapsed speech suppression” [paper]
“Collapsed speech segment detection and suppression for WaveNet vocoder” [paper] [code]
“The NU non-parallel voice conversion system for the voice conversion challenge 2018” [paper]

Abstract

We integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN vocoder with a VC system, the distorted acoustic features, acoustic and temporal mismatches, and exposure bias usually lead to significant speech quality degradation, making WN generate some very noisy speech segments called collapsed speech. To tackle the problem, we take conventional-vocoder-generated speech as the reference speech to derive a linear predictive coding distribution constraint (LPCDC) to avoid the collapsed speech problem. Furthermore, to mitigate the negative effects introduced by the LPCDC, we propose a collapsed speech segment detector (CSSD) to ensure that the LPCDC is only applied to the problematic segments to limit the loss of quality to short periods.

Testing corpus: VCC2018

Collapsed speech problem

Type I: white-noise-like segment
Type II: short impulse noise

Figure (a): WN-generated waveforms w/ collapsed speech.
Figure (b): WN-generated waveforms w/ LPCDC and CSSD.

Collapsed speech segment detection (CSSD)

Waveform envelope extraction

Compared with the reference

Figure (a): WN-generated waveforms w/ collapsed speech.
Figure (b): WORLD-generated waveforms (reference).
Figure (c): Extracted waveform envelopes.
Figure (d): Difference in waveform envelope.

LPC distribution constraint (LPCDC)

Constrain the WN-predicted PMF with the LPC PMF

WN vocoder with CSSD and LPCDC

The flowchart of the proposed WN vocoder with the collapsed speech suppression

Speaker voice conversion (Non-parallel)

Intra gender conversion

Vocoder	Female (SF4->TF1)	Male (SM3->TM1)
Source
Target
Target + WN
	Collapsed-free	Collapsed-free
DNN + WN
DMDN + WN
DMDN + WORLD
DMDN + LPCDC
	Collapsed	Collapsed
DNN + WN
DMDN + WN
DMDN + WORLD
DMDN + LPCDC
DMDN + LPCDC + CSSD

Inter gender conversion

Vocoder	Female (SF3->TM2)	Male (SM4->TF2)
Source
Target
Target + WN
	Collapsed-free	Collapsed-free
DNN + WN
DMDN + WN
DMDN + WORLD
DMDN + LPCDC
	Collapsed	Collapsed
DNN + WN
DMDN + WN
DMDN + WORLD
DMDN + LPCDC
DMDN + LPCDC + CSSD

Subjective evaluation I: MOS of naturalness.

Subjective evaluation II: Speaker similarity

Home

page layout is modified from cayman-theme and cayman-blog. LICENSE