This page is the demo of
- “Quasi-Periodic WaveNet: an autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network “ [paper] [code] [YouTube] [Medium]
- “Quasi-Periodic WaveNet vocoder: a pitch dependent dilated convolution model for parametric speech generation” [paper]
- “Statistical voice conversion with Quasi-Periodic WaveNet vocoder” [paper]
Abstract
We propose a WaveNet-like quasi-periodic audio waveform generation model (QPNet) with a novel network architecture named pitch-dependent dilated convolution neural network (PDCNN) to improve pitch controllability of WaveNet (WN). The effectiveness of WN as a vocoder to generate high-fidelity speech samples on the basis of the given acoustic features has been proved. However, because of the fixed dilated convolution neural network (DCNN) and generic network architecture, the WN vocoder has a difficulty to generate speech with the given fundamental frequency (F0) values which are outside the observed F0 range of training data. To address this limitation, we propose the QPNet vocoder with the PDCNN component and a cascade network structure to respectively model the long- and short-term correlations of speech samples. Specifically, PDCNN is a variant of DCNN which has the time-variant adaptive dilation size related to the given F0 values. QPNet cascades the adaptive (with PDCNNs) and fixed (with DCNNs) macroblocks to respectively model the periodicity and local correlations of speech signals.
This demo page includes Pitch transformation, Voice conversion, and Sinusoid generation results.
Testing corpus: VCC2018
Architecture of QPNet vocoder
Pitch-dependent dilated convolution
Pitch transformation
To show the pitch controllability of the proposed QPNet vocoder, the pitch-transformed samples generated by different vocoders are provided. The input F0 feature of each vocoder was scaled by 1, 1/2, and 2 while keeping the other acoustic features the same as that of natural speech.
- Conditioned on unchanged F0
Vocoder | Female (SF3) | Male (SM3) |
---|---|---|
Natural | ||
WORLD | ||
WNf *1 | ||
WNc *2 | ||
QPNet *3 | ||
rQPNet *4 | ||
full QPNet | ||
full rQPNet |
*1. WNf: WaveNet vocoder with full-size (30 layers)
*2. WNc: WaveNet vocoder with compact-size (16 layers)
*3. QPNet: QPNet vocoder with fixed-adaptive order(16 layers)
*4. rQPNet: QPNet vocoder with reversed adaptive-fixed order(16 layers)
**. full: full-size (r)QPNet vocoder(34 layers)
- Conditioned on 1/2 F0
Vocoder | Female (SF3) | Male (SM3) |
---|---|---|
WORLD | ||
WNf | ||
WNc | ||
QPNet | ||
rQPNet | ||
full QPNet | ||
full rQPNet |
- Conditioned on 3/2 F0
Vocoder | Female (SF3) | Male (SM3) |
---|---|---|
WORLD | ||
WNf | ||
WNc | ||
QPNet | ||
rQPNet | ||
full QPNet | ||
full rQPNet |
- Subjective results
Speaker voice conversion (VC)
To effectiveness of the proposed QPNet vocoder was also evaluated with our NU non-parallel VC system submitted to VCC2018.
- Intra gender conversion
Vocoder | Female (SF3->TF1) | Male (SM3->TM1) |
---|---|---|
Source | ||
Target | ||
WORLD | ||
SI-WNf | ||
SI-WNc | ||
SI-QPNet | ||
SDo-WNf | ||
SDo-WNc | ||
SDo-QPNet | ||
SDa-WNf | ||
SDa-WNc | ||
SDa-QPNet |
**. SI: speaker independent vocoder
**. SDo: speaker dependent vocoder (only update the output layers of WN)
**. SDa: speaker dependent vocoder (update the whole WN network)
- Inter gender conversion
Vocoder | Female to male (SF3->TM1) | Male to female (SM3->TF1) |
---|---|---|
Source | ||
Target | ||
WORLD | ||
SI-WNf | ||
SI-WNc | ||
SI-QPNet | ||
SDo-WNf | ||
SDo-WNc | ||
SDo-QPNet | ||
SDa-WNf | ||
SDa-WNc | ||
SDa-QPNet |
- Subjective results
Single tone sinusoid generation
To evaluate the effectiveness of the proposed PDCNN, a evaluation of simple periodic sinusoid generation was conducted. The training data of QPNet were 80-400 Hz sinusoids and the corresponding F0 values. In the test phase, QPNet was conditioned on an F0 value and a small piece of the related sine wave for the initial receptive field to generate sinusoids. The test data were divided into 10-40 Hz (under 1/2L), 50-80 Hz (above 1/2L), 100-400 Hz (inside), 450–600 Hz (under 3/2U), and 650–800 Hz (above 3/2U) subsets. L is the lower bound and U is the upper bound of the training F0 range. Moreover, because of the simple periodic signal generation scenario, the QPNet model with the pure PDCNN structure (pQPNet) was adopted.
- Dense factor comparison
First, the pQPNet models with different dense factor were evaluated.
pQPNet (Dense: 8) | pQPNet (Dense: 1) | pQPNet (Dense: 64) |
20 Hz (Under 1/2L) | ||
60 Hz (Above 1/2L) | ||
300 Hz (Inside) | ||
500 Hz (Under 3/2U) | ||
700 Hz (Above 3/2U) | ||
- Model comparison
Secondly, the pQPNet model with a dense factor 8 were compared to the WNc and WNf models.
pQPNet (Dense: 8) | WNc | WNf |
20 Hz (Under 1/2L) | ||
60 Hz (Above 1/2L) | ||
300 Hz (Inside) | ||
500 Hz (Under 3/2U) | ||
700 Hz (Above 3/2U) | ||
page layout is modified from cayman-theme and cayman-blog. LICENSE