Audio Dequantization for High Fidelity Audio Generation in Flow-based Neural Vocoder

Authors

Hyun-Wook Yoon(Korea University), hw_yoon@korea.ac.kr

Sang-Hoon Lee(Korea University), sh_lee@korea.ac.kr

Hyeong-Rae Noh(Korea University), hr_noh@korea.ac.kr

Seong-Whan Lee(Korea University), sw.lee@korea.ac.kr

Abstract

In recent works, a flow-based network has shown significant improvement in real-time speech generation task. The sequence of flow operations allows the model to convert Gaussian latent variables into a waveform signal. However, training continuous density model on discrete data may lead to degraded performance of the model due to the topological difference between latent and actual distribution. To resolve this problem, we propose various audio dequantization methods in flow-based neural vocoder for high fidelity audio generation. Data dequantization is a well-known method in the image generation field, but such an effect in the audio domain is yet to be found. For this reason, we implement various dequantization methods and investigate the effect on the generated audio. In addition, we present various objective performance assessments and subjective evaluation to show that audio dequantization improves generated audio quality. From our experiments, we demonstrate that using audio dequantization can produce a better harmonic structure with less noise artifacts.