Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Recombination for Speech Synthesis

BLIND

ABSTRACT

While generative adversarial networks (GANs) based neural text-to-speech (TTS) systems have shown significant improvement in neural speech synthesis, there is no TTS system to learn to synthesize speech from text sequences with only adversarial feedback. Because adversarial feedback alone is not sufficient to train the generator, current models still require the reconstruction loss compared with the ground-truth and the generated mel-spectrogram directly. In this paper, we present Multi-SpectroGAN (MSG), which can train the multi-speaker model with only the adversarial feedback by conditioning a self-supervised hidden representation of the generator to a conditional discriminator. This leads to better guidance for generator training. Moreover, we also propose adversarial style recombination (ASR) for better generalization in the unseen speaking style and transcript, which can learn latent representations of the combined style embedding from multiple mel-spectrograms. Trained with ASR and feature matching, the MSG synthesizes a high-diversity mel-spectrogram by controlling and mixing the individual speaking styles (e.g., duration, pitch, and energy). The result shows that the MSG synthesizes a high-fidelity mel-spectrogram, which has almost the same naturalness MOS score as the ground-truth mel-spectrogram.

STYLE RECOMBINATION

Style

Duration
1      0
Pitch
1      0
Energy
1      0

p237 (Male)

Mixed

p305 (Female)







MULTI-SPEAKER TTS

p229 (Female)

Script : The actual primary rainbow observed is said to be the effect of super-imposition of a number of bows.

GT

GT + PWG

Tacotron2

GST

FastSpeech2

MSG

MSG+ASR (Bern, same ratios)

MSG+ASR (Mixup, same ratios)

MSG+ASR (Bern, different ratios)

MSG+ASR (Mixup, different ratios)

p243 (Male)

Script : Ask her to bring these things with her from the store.

GT

GT + PWG

Tacotron2

GST

FastSpeech2

MSG

MSG+ASR (Bern, same ratios)

MSG+ASR (Mixup, same ratios)

MSG+ASR (Bern, different ratios)

MSG+ASR (Mixup, different ratios)

p228 (Female)

Script : People look, but no one ever finds it.

GT

GT + PWG

Tacotron2

GST

FastSpeech2

MSG

MSG+ASR (Bern, same ratios)

MSG+ASR (Mixup, same ratios)

MSG+ASR (Bern, different ratios)

MSG+ASR (Mixup, different ratios)

p237 (Male)

Script : If the red of the second bow falls upon the green of the first, the result is to give a bow with an abnormally wide yellow band, since red and green light when mixed form yellow.

GT

GT + PWG

Tacotron2

GST

FastSpeech2

MSG

MSG+ASR (Bern, same ratios)

MSG+ASR (Mixup, same ratios)

MSG+ASR (Bern, different ratios)

MSG+ASR (Mixup, different ratios)

p301 (Female)

Script : There is, according to legend, a boiling pot of gold at one end.

GT

GT + PWG

Tacotron2

GST

FastSpeech2

MSG

MSG+ASR (Bern, same ratios)

MSG+ASR (Mixup, same ratios)

MSG+ASR (Bern, different ratios)

MSG+ASR (Mixup, different ratios)

p252 (Male)

Script : There is, according to legend, a boiling pot of gold at one end.

GT

GT + PWG

Tacotron2

GST

FastSpeech2

MSG

MSG+ASR (Bern, same ratios)

MSG+ASR (Mixup, same ratios)

MSG+ASR (Bern, different ratios)

MSG+ASR (Mixup, different ratios)

ZERO-SHOT TTS

p236 (Female)

Script : Some have accepted it as a miracle without physical explanation.

GT

GT + PWG

Tacotron2

GST

FastSpeech2

MSG

MSG+ASR (Bern, same ratios)

MSG+ASR (Mixup, same ratios)

MSG+ASR (Bern, different ratios)

MSG+ASR (Mixup, different ratios)

p241 (Male)

Script : Aristotle thought that the rainbow was caused by reflection of the sun's rays by the rain.

GT

GT + PWG

Tacotron2

GST

FastSpeech2

MSG

MSG+ASR (Bern, same ratios)

MSG+ASR (Mixup, same ratios)

MSG+ASR (Bern, different ratios)

MSG+ASR (Mixup, different ratios)

p240 (Female)

Script : When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow.

GT

GT + PWG

Tacotron2

GST

FastSpeech2

MSG

MSG+ASR (Bern, same ratios)

MSG+ASR (Mixup, same ratios)

MSG+ASR (Bern, different ratios)

MSG+ASR (Mixup, different ratios)

p246 (Male)

Script : Since then physicists have found that it is not reflection, but refraction by the raindrops which causes the rainbows.

GT

GT + PWG

Tacotron2

GST

FastSpeech2

MSG

MSG+ASR (Bern, same ratios)

MSG+ASR (Mixup, same ratios)

MSG+ASR (Bern, different ratios)

MSG+ASR (Mixup, different ratios)

p265 (Female)

Script : Others have tried to explain the phenomenon physically.

GT

GT + PWG

Tacotron2

GST

FastSpeech2

MSG

MSG+ASR (Bern, same ratios)

MSG+ASR (Mixup, same ratios)

MSG+ASR (Bern, different ratios)

MSG+ASR (Mixup, different ratios)

p363 (Male)

Script : Some have accepted it as a miracle without physical explanation.

GT

GT + PWG

Tacotron2

GST

FastSpeech2

MSG

MSG+ASR (Bern, same ratios)

MSG+ASR (Mixup, same ratios)

MSG+ASR (Bern, different ratios)

MSG+ASR (Mixup, different ratios)