Quasi-Periodic WaveNet

This page is the demo of

“Quasi-Periodic WaveNet: an autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network “ [paper] [code] [YouTube] [Medium]
“Quasi-Periodic WaveNet vocoder: a pitch dependent dilated convolution model for parametric speech generation” [paper]
“Statistical voice conversion with Quasi-Periodic WaveNet vocoder” [paper]

Abstract

We propose a WaveNet-like quasi-periodic audio waveform generation model (QPNet) with a novel network architecture named pitch-dependent dilated convolution neural network (PDCNN) to improve pitch controllability of WaveNet (WN). The effectiveness of WN as a vocoder to generate high-fidelity speech samples on the basis of the given acoustic features has been proved. However, because of the fixed dilated convolution neural network (DCNN) and generic network architecture, the WN vocoder has a difficulty to generate speech with the given fundamental frequency (F₀) values which are outside the observed F₀ range of training data. To address this limitation, we propose the QPNet vocoder with the PDCNN component and a cascade network structure to respectively model the long- and short-term correlations of speech samples. Specifically, PDCNN is a variant of DCNN which has the time-variant adaptive dilation size related to the given F₀ values. QPNet cascades the adaptive (with PDCNNs) and fixed (with DCNNs) macroblocks to respectively model the periodicity and local correlations of speech signals.

This demo page includes Pitch transformation, Voice conversion, and Sinusoid generation results.

Testing corpus: VCC2018

Architecture of QPNet vocoder

Pitch-dependent dilated convolution

Pitch transformation

To show the pitch controllability of the proposed QPNet vocoder, the pitch-transformed samples generated by different vocoders are provided. The input F₀ feature of each vocoder was scaled by 1, 1/2, and 2 while keeping the other acoustic features the same as that of natural speech.

Conditioned on unchanged F₀

Vocoder	Female (SF3)	Male (SM3)
Natural
WORLD
WNf^*1
WNc^*2
QPNet^*3
rQPNet^*4
full QPNet
full rQPNet

^{*1. WNf: WaveNet vocoder with full-size (30 layers)}
^{*2. WNc: WaveNet vocoder with compact-size (16 layers)}
^{*3. QPNet: QPNet vocoder with fixed-adaptive order(16 layers)}
^{*4. rQPNet: QPNet vocoder with reversed adaptive-fixed order(16 layers)}
^{**. full: full-size (r)QPNet vocoder(34 layers)}

Conditioned on 1/2 F₀

Vocoder	Female (SF3)	Male (SM3)
WORLD
WNf
WNc
QPNet
rQPNet
full QPNet
full rQPNet

Conditioned on 3/2 F₀

Vocoder	Female (SF3)	Male (SM3)
WORLD
WNf
WNc
QPNet
rQPNet
full QPNet
full rQPNet

Subjective results

Speaker voice conversion (VC)

To effectiveness of the proposed QPNet vocoder was also evaluated with our NU non-parallel VC system submitted to VCC2018.

Intra gender conversion

Vocoder	Female (SF3->TF1)	Male (SM3->TM1)
Source
Target
WORLD
SI-WNf
SI-WNc
SI-QPNet
SDo-WNf
SDo-WNc
SDo-QPNet
SDa-WNf
SDa-WNc
SDa-QPNet

^{**. SI: speaker independent vocoder}
^{**. SDo: speaker dependent vocoder (only update the output layers of WN)}
^{**. SDa: speaker dependent vocoder (update the whole WN network)}

Inter gender conversion

Vocoder	Female to male (SF3->TM1)	Male to female (SM3->TF1)
Source
Target
WORLD
SI-WNf
SI-WNc
SI-QPNet
SDo-WNf
SDo-WNc
SDo-QPNet
SDa-WNf
SDa-WNc
SDa-QPNet

Subjective results

Single tone sinusoid generation

To evaluate the effectiveness of the proposed PDCNN, a evaluation of simple periodic sinusoid generation was conducted. The training data of QPNet were 80-400 Hz sinusoids and the corresponding F₀ values. In the test phase, QPNet was conditioned on an F₀ value and a small piece of the related sine wave for the initial receptive field to generate sinusoids. The test data were divided into 10-40 Hz (under 1/2L), 50-80 Hz (above 1/2L), 100-400 Hz (inside), 450–600 Hz (under 3/2U), and 650–800 Hz (above 3/2U) subsets. L is the lower bound and U is the upper bound of the training F₀ range. Moreover, because of the simple periodic signal generation scenario, the QPNet model with the pure PDCNN structure (pQPNet) was adopted.

Dense factor comparison
First, the pQPNet models with different dense factor were evaluated.

pQPNet (Dense: 8)	pQPNet (Dense: 1)	pQPNet (Dense: 64)
20 Hz (Under 1/2L)


60 Hz (Above 1/2L)


300 Hz (Inside)


500 Hz (Under 3/2U)


700 Hz (Above 3/2U)