This page is the demo of

  1. “Quasi-Periodic WaveNet: an autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network “ [paper] [code] [YouTube] [Medium]
  2. “Quasi-Periodic WaveNet vocoder: a pitch dependent dilated convolution model for parametric speech generation” [paper]
  3. “Statistical voice conversion with Quasi-Periodic WaveNet vocoder” [paper]

Abstract

We propose a WaveNet-like quasi-periodic audio waveform generation model (QPNet) with a novel network architecture named pitch-dependent dilated convolution neural network (PDCNN) to improve pitch controllability of WaveNet (WN). The effectiveness of WN as a vocoder to generate high-fidelity speech samples on the basis of the given acoustic features has been proved. However, because of the fixed dilated convolution neural network (DCNN) and generic network architecture, the WN vocoder has a difficulty to generate speech with the given fundamental frequency (F0) values which are outside the observed F0 range of training data. To address this limitation, we propose the QPNet vocoder with the PDCNN component and a cascade network structure to respectively model the long- and short-term correlations of speech samples. Specifically, PDCNN is a variant of DCNN which has the time-variant adaptive dilation size related to the given F0 values. QPNet cascades the adaptive (with PDCNNs) and fixed (with DCNNs) macroblocks to respectively model the periodicity and local correlations of speech signals.

This demo page includes Pitch transformation, Voice conversion, and Sinusoid generation results.

Testing corpus: VCC2018

Architecture of QPNet vocoder

Pitch-dependent dilated convolution

Pitch transformation

To show the pitch controllability of the proposed QPNet vocoder, the pitch-transformed samples generated by different vocoders are provided. The input F0 feature of each vocoder was scaled by 1, 1/2, and 2 while keeping the other acoustic features the same as that of natural speech.

Vocoder Female (SF3) Male (SM3)
Natural
WORLD
WNf *1
WNc *2
QPNet *3
rQPNet *4
full QPNet
full rQPNet

*1. WNf: WaveNet vocoder with full-size (30 layers)
*2. WNc: WaveNet vocoder with compact-size (16 layers)
*3. QPNet: QPNet vocoder with fixed-adaptive order(16 layers)
*4. rQPNet: QPNet vocoder with reversed adaptive-fixed order(16 layers)
**. full: full-size (r)QPNet vocoder(34 layers)

Vocoder Female (SF3) Male (SM3)
WORLD
WNf
WNc
QPNet
rQPNet
full QPNet
full rQPNet


Vocoder Female (SF3) Male (SM3)
WORLD
WNf
WNc
QPNet
rQPNet
full QPNet
full rQPNet


Speaker voice conversion (VC)

To effectiveness of the proposed QPNet vocoder was also evaluated with our NU non-parallel VC system submitted to VCC2018.

Vocoder Female (SF3->TF1) Male (SM3->TM1)
Source
Target
WORLD
SI-WNf
SI-WNc
SI-QPNet
SDo-WNf
SDo-WNc
SDo-QPNet
SDa-WNf
SDa-WNc
SDa-QPNet

**. SI: speaker independent vocoder
**. SDo: speaker dependent vocoder (only update the output layers of WN)
**. SDa: speaker dependent vocoder (update the whole WN network)

Vocoder Female to male (SF3->TM1) Male to female (SM3->TF1)
Source
Target
WORLD
SI-WNf
SI-WNc
SI-QPNet
SDo-WNf
SDo-WNc
SDo-QPNet
SDa-WNf
SDa-WNc
SDa-QPNet


Single tone sinusoid generation

To evaluate the effectiveness of the proposed PDCNN, a evaluation of simple periodic sinusoid generation was conducted. The training data of QPNet were 80-400 Hz sinusoids and the corresponding F0 values. In the test phase, QPNet was conditioned on an F0 value and a small piece of the related sine wave for the initial receptive field to generate sinusoids. The test data were divided into 10-40 Hz (under 1/2L), 50-80 Hz (above 1/2L), 100-400 Hz (inside), 450–600 Hz (under 3/2U), and 650–800 Hz (above 3/2U) subsets. L is the lower bound and U is the upper bound of the training F0 range. Moreover, because of the simple periodic signal generation scenario, the QPNet model with the pure PDCNN structure (pQPNet) was adopted.

pQPNet (Dense: 8) pQPNet (Dense: 1) pQPNet (Dense: 64)
20 Hz (Under 1/2L)
60 Hz (Above 1/2L)
300 Hz (Inside)
500 Hz (Under 3/2U)
700 Hz (Above 3/2U)


pQPNet (Dense: 8) WNc WNf
20 Hz (Under 1/2L)
60 Hz (Above 1/2L)
300 Hz (Inside)
500 Hz (Under 3/2U)
700 Hz (Above 3/2U)


Home




page layout is modified from cayman-theme and cayman-blog. LICENSE