Quasi-Periodic Parallel WaveGAN

This page is the demo of

“Quasi-periodic parallel WaveGAN: a non-autoregressive raw waveform generative model with pitch-dependent dilated convolution neural networks” [paper] [code] [YouTube] [Medium]
“Quasi-periodic parallel WaveGAN vocoder: a non-autoregressive pitch-dependent dilated convolution model for parametric speech generation” [paper] [highlight] [fullVideo]

Abstract

We propose a Quasi-Periodic Parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PDCNNs). PWG is a compact GAN-based raw waveform generative model, whose generation time is much faster than realtime because of its non-autoregressive (non-AR) and non-causal mechanisms. Although PWG achieves a high fidelity speech generation, the generic and simple network architecture lacks pitch-controllability for the unseen auxiliary pitches such as a scaled pitch. To improve the pitch and speech modeling capability, we apply a QP structure with PDCNNs to the generator of PWG, and it introduces pitch information to the network by dynamically changing the network architecture corresponding to the auxiliary pitches.

Corpus and references:
VCC2018
PWG
PWG_repo
QPNet

Architecture of PWG/QPPWG

Generator of QPPWG

Non-AR PDCNN

Demo Sounds

Conditioned on 1×F₀

Vocoder	Female (SF3)	Male (SM3)
Natural
WORLD^*1
QPNet^*2
PWG_30^*3
PWG_20^*4
QPPWG_20^*5
PWG_16^*6
QPPWG_16^*7

^{*1. WORLD: Baseline I}
^{*2. QPNet: Baseline II}
^{*3. PWG_30: PWG vocoder with 30 fixed blocks}
^{*4. PWG_20: PWG vocoder with 20 fixed blocks}
^{*5. QPPWG_20: QPPWG vocoder with 10 adaptive blocks + 10 fixed blocks}
^{*6. PWG_16: PWG vocoder with 16 fixed blocks}
^{*7. QPPWG_16: QPPWG vocoder with 8 adaptive blocks + 8 fixed blocks}

Conditioned on ½×F₀

Vocoder	Female (SF3)	Male (SM3)
WORLD
QPNet
PWG_30
PWG_20
QPPWG_20
PWG_16
QPPWG_16

Conditioned on 2×F₀

Vocoder	Female (SF3)	Male (SM3)
WORLD
QPNet
PWG_30
PWG_20
QPPWG_20
PWG_16
QPPWG_16

Subjective Results

MOS results of speech quality

XAB results of pitch accuracy

Visualized Intermediate Outputs

Because the waveform outputs of the PWG/QPPWG models are the cumulative results of the skip connections from the residual blocks, the speech modeling behavior of the residual blocks can be explored via the visualized intermediate outputs of partial residual blocks. The following table shows the spectrograms of the intermediate outputs of the cumulative residual blocks.

PWG (PWG_20)	QPPWG (adaptive->fixed)	QPPWG (fixed->adaptive)
1-20: fixed blocks	1-10: adaptive blocks 11-20: fixed blocks	1-10: fixed blocks 11-20: adaptive blocks
Conditioned on 1×F₀

Conditioned on ½×F₀

Conditioned on 2×F₀

According to the results, we can find that

PWG (PWG_20): spectrograms contain more harmonic and non-harmonic details as the number of the cumulative residual blocks increases.
QPPWG (adaptive->fixed): the first ten adaptive blocks focus on modeling the harmonic components.
QPPWG (fixed->adaptive): the first ten fixed blocks focus on modeling the the non-harmonic components.

Furthermore, the audio files of the QPPWG intermediate outputs are also provided.

	QPPWG (adaptive->fixed)	QPPWG (fixed->adaptive)
1-10 blocks	adaptive blocks	fixed blocks
Conditioned on 1×F₀
outputs of 1-10 blocks
Final outputs
Conditioned on ½×F₀
outputs of 1-10 blocks
Final outputs
Conditioned on 2×F₀
outputs of 1-10 blocks
Final outputs

The cumulative outputs of the adaptive blocks are excitation-signal-like and highly pitch-dependent while that of the fixed blocks are spectral-related and less pitch-dependent. The results confirm our assumption that that the adaptive blocks with the PDCNNs primarily model the pitch-related speech components with the long-term correlations while the fixed blocks with the DCNNs mainly focus on the spectral-related speech components with the short-term correlations.

Home

page layout is modified from cayman-theme and cayman-blog. LICENSE