This page is the demo of
- “Quasi-periodic parallel WaveGAN: a non-autoregressive raw waveform generative model with pitch-dependent dilated convolution neural networks” [paper] [code] [YouTube] [Medium]
- “Quasi-periodic parallel WaveGAN vocoder: a non-autoregressive pitch-dependent dilated convolution model for parametric speech generation” [paper] [highlight] [fullVideo]
Abstract
We propose a Quasi-Periodic Parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PDCNNs). PWG is a compact GAN-based raw waveform generative model, whose generation time is much faster than realtime because of its non-autoregressive (non-AR) and non-causal mechanisms. Although PWG achieves a high fidelity speech generation, the generic and simple network architecture lacks pitch-controllability for the unseen auxiliary pitches such as a scaled pitch. To improve the pitch and speech modeling capability, we apply a QP structure with PDCNNs to the generator of PWG, and it introduces pitch information to the network by dynamically changing the network architecture corresponding to the auxiliary pitches.
Corpus and references:
VCC2018
PWG
PWG_repo
QPNet
Architecture of PWG/QPPWG
Generator of QPPWG
Non-AR PDCNN
Demo Sounds
- Conditioned on 1×F0
Vocoder | Female (SF3) | Male (SM3) |
---|---|---|
Natural | ||
WORLD *1 | ||
QPNet *2 | ||
PWG_30 *3 | ||
PWG_20 *4 | ||
QPPWG_20 *5 | ||
PWG_16 *6 | ||
QPPWG_16 *7 |
*1. WORLD: Baseline I
*2. QPNet: Baseline II
*3. PWG_30: PWG vocoder with 30 fixed blocks
*4. PWG_20: PWG vocoder with 20 fixed blocks
*5. QPPWG_20: QPPWG vocoder with 10 adaptive blocks + 10 fixed blocks
*6. PWG_16: PWG vocoder with 16 fixed blocks
*7. QPPWG_16: QPPWG vocoder with 8 adaptive blocks + 8 fixed blocks
- Conditioned on ½×F0
Vocoder | Female (SF3) | Male (SM3) |
---|---|---|
WORLD | ||
QPNet | ||
PWG_30 | ||
PWG_20 | ||
QPPWG_20 | ||
PWG_16 | ||
QPPWG_16 |
- Conditioned on 2×F0
Vocoder | Female (SF3) | Male (SM3) |
---|---|---|
WORLD | ||
QPNet | ||
PWG_30 | ||
PWG_20 | ||
QPPWG_20 | ||
PWG_16 | ||
QPPWG_16 |
Subjective Results
- MOS results of speech quality
- XAB results of pitch accuracy
Visualized Intermediate Outputs
Because the waveform outputs of the PWG/QPPWG models are the cumulative results of the skip connections from the residual blocks, the speech modeling behavior of the residual blocks can be explored via the visualized intermediate outputs of partial residual blocks. The following table shows the spectrograms of the intermediate outputs of the cumulative residual blocks.
PWG (PWG_20) | QPPWG (adaptive->fixed) | QPPWG (fixed->adaptive) |
1-20: fixed blocks | 1-10: adaptive blocks 11-20: fixed blocks |
1-10: fixed blocks 11-20: adaptive blocks |
Conditioned on 1×F0 | ||
Conditioned on ½×F0 | ||
Conditioned on 2×F0 | ||
According to the results, we can find that
- PWG (PWG_20): spectrograms contain more harmonic and non-harmonic details as the number of the cumulative residual blocks increases.
- QPPWG (adaptive->fixed): the first ten adaptive blocks focus on modeling the harmonic components.
- QPPWG (fixed->adaptive): the first ten fixed blocks focus on modeling the the non-harmonic components.
Furthermore, the audio files of the QPPWG intermediate outputs are also provided.
QPPWG (adaptive->fixed) | QPPWG (fixed->adaptive) | |
1-10 blocks | adaptive blocks | fixed blocks |
Conditioned on 1×F0 | ||
outputs of 1-10 blocks |
||
Final outputs | ||
Conditioned on ½×F0 | ||
outputs of 1-10 blocks |
||
Final outputs | ||
Conditioned on 2×F0 | ||
outputs of 1-10 blocks |
||
Final outputs |
The cumulative outputs of the adaptive blocks are excitation-signal-like and highly pitch-dependent while that of the fixed blocks are spectral-related and less pitch-dependent. The results confirm our assumption that that the adaptive blocks with the PDCNNs primarily model the pitch-related speech components with the long-term correlations while the fixed blocks with the DCNNs mainly focus on the spectral-related speech components with the short-term correlations.
page layout is modified from cayman-theme and cayman-blog. LICENSE