SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis

Teysir Baoueb; Haocheng Liu; Mathieu Fontaine; Jonathan Le Roux; Gael Richard

Conference Papers Year : 2024

SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis

(1, 2) , (1, 2) , (1, 2) , (3) , (1, 2)

1
2
3

Teysir Baoueb

Function : Author
PersonId : 1343186
ORCID : 0009-0001-2263-4309

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Haocheng Liu

Function : Author
PersonId : 1344278

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Mathieu Fontaine

Function : Author
PersonId : 13405
IdHAL : mathieu-fontaine
ORCID : 0000-0002-7657-6271
IdRef : 236886681

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Jonathan Le Roux

Function : Author

Mitsubishi Electric Research Laboratories

Gael Richard

Function : Author
PersonId : 14146
IdHAL : gael-richard
IdRef : 094977208

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Abstract

Generative adversarial network (GAN) models can synthesize highquality audio signals while ensuring fast sample generation. However, they are difficult to train and are prone to several issues including mode collapse and divergence. In this paper, we introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN, which was initially devised for speech synthesis from mel spectrogram. In our model, the training stability is enhanced by means of a forward diffusion process which consists in injecting noise from a Gaussian distribution to both real and fake samples before inputting them to the discriminator. We further improve the model by exploiting a spectrally-shaped noise distribution with the aim to make the discriminator's task more challenging. We then show the merits of our proposed model for speech and music synthesis on several datasets. Our experiments confirm that our model compares favorably in audio quality and efficiency compared to several baselines.

Keywords

Generative adversarial network (GAN) diffusion process deep audio synthesis spectral envelope

Domains

Machine Learning [cs.LG] Sound [cs.SD] Signal and Image Processing

Fichier principal

ICASSP_2024_SpecDiff_GAN___Preprint.pdf (466.56 Ko)

Origin : Files produced by the author(s)

Teysir Baoueb : Connect in order to contact the contributor

https://hal.science/hal-04423979

Submitted on : Monday, January 29, 2024-1:54:24 PM

Last modification on : Wednesday, February 14, 2024-3:19:27 PM

Dates and versions

hal-04423979 , version 1 (29-01-2024)

Identifiers

HAL Id : hal-04423979 , version 1

Cite

Teysir Baoueb, Haocheng Liu, Mathieu Fontaine, Jonathan Le Roux, Gael Richard. SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 2024, Seoul (Korea), South Korea. ⟨hal-04423979⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM LTCI IDS S2A IP_PARIS

99 View

142 Download

SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis

Abstract

Keywords

Domains

Dates and versions

Identifiers

Cite

Export

Collections

Share