ComplexDec: A Domain-robust High-fidelity Neural Audio Codec with Complex Spectrum Modeling

Yi-Chiao Wu, Dejan Marković, Steven Krenn, Israel D. Gebru, and Alexander Richard
Meta Reality Labs Research, USA


This page is the demo of ComplexDec [paper]

Abstract

Neural audio codecs have been widely adopted in audio-generative tasks because their compact and discrete representations are suitable for both large-language-model-style and regression-based generative models. However, most neural codecs struggle to model out-of-domain audio, resulting in error propagations to downstream generative tasks. In this paper, we first argue that information loss from codec compression degrades out-of-domain robustness. Then, we propose full-band 48 kHz ComplexDec with complex spectral input and output to ease the information loss while adopting the same 24 kbps bitrate as the baseline AuidoDec and ScoreDec. Objective and subjective evaluations demonstrate the out-of-domain robustness of ComplexDec trained using only the 30-hour VCTK corpus.

Architecture

Demo Sounds

Codec Amusing Anger
Natural 48 kHz
Natural 24 kHz
AudioDec (in-domain)
ScoreDec (in-domain)
ComplexDec (in-domain)
AudioDec (out-of-domain)
ScoreDec (out-of-domain)
ComplexDec (out-of-domain)
Encodec 48 kHz
Encodec 24 kHz
DAC 24 kHz
Codec Reading Loud
Natural 48 kHz
Natural 24 kHz
AudioDec (in-domain)
ScoreDec (in-domain)
ComplexDec (in-domain)
AudioDec (out-of-domain)
ScoreDec (out-of-domain)
ComplexDec (out-of-domain)
Encodec 48 kHz
Encodec 24 kHz
DAC 24 kHz
Codec Reading Whisper
Natural 48 kHz
Natural 24 kHz
AudioDec (in-domain)
ScoreDec (in-domain)
ComplexDec (in-domain)
AudioDec (out-of-domain)
ScoreDec (out-of-domain)
ComplexDec (out-of-domain)
Encodec 48 kHz
Encodec 24 kHz
DAC 24 kHz

Speech Quality Measurments

ComplexDec achieves similar in-domain and out-of-domain coding qualities while AudioDec and ScoreDec suffer significant degradation in coding the out-of-domain speech. ComplexDec also significantly outperforms the open-source Encodec models. The results indicate that the serious information loss cannot be fully compensated by the SPF or by solely increasing the training data. On the other hand, DAC also achieves impressive out-of-domain robustness because of its low compression ratio. However, the marked quality gap between ComplexDec and DAC shows the significant perceptual quality difference between 48 kHz and 24 kHz speech.


Out-of-domain Magnitude Spectral Comparison

We can find that AudioDec fails to reconstruct the harmonic structures and the blur spectrum results in hoarse speech. Although the SPF of ScoreDec can slightly recover the blurry spectrum because of the diffusion nature, the missing harmonics cannot be well recovered. However, ComplexDec well preserves the harmonic structures below 6~kHz because of the less information loss.


Liability Disclaimer

The demo page utilizes a public speech dataset (EARS) for demonstration purposes only. The Content of the demo files is provided "as is" and for general informational purposes only. We make no warranties regarding its accuracy or suitability. If you believe that any speech samples infringe upon your rights or violate any laws, please contact us to remove the demo files. We are not liable for any damages arising from the use or reliance on our demo page or open-source code. By accessing the demo page, you agree to release us from any claims or liabilities related to its use.


Home




page layout is modified from cayman-theme and cayman-blog. LICENSE