Noise-robust zero-shot text-to-speech synthesis conditioned by
self-supervised speech-representation model with adapters

Kenichi Fujita, Hiroshi Sato, Takanori Ashihara, Hiroki Kanagawa
Marc Delcroix, Takafumi Moriya, and Yusuke Ijima

NTT Corporation, Japan

Samples

  • HiFi-GAN was used for all speech sample generation
  • All speakers in this page are chosen from test data.
  • Performance comparison on zero-shot TTS

  • All speech samples are generated from the method with speech enhancement
  • Enhanced reference speech: the reference speech processed with speech enhancement. This SSL model processes this enhanced speech.
  • Without fine-tuning: Without fine-tuning pretrained TTS model based on the SSL model
  • without adapter: fine-tuned without using adapters
  • With bottleneck and CNN adapters: fine-tuned with adapters
  • "Original" in table is the target speaker's speech with the same speech contents for the generated speech. Note that this is not the reference speech.
  • Female

  • music (music-jamendo-0197)
  • Reference speech Enhanced reference speech Without fine-tuning without adapters With bottleneck and CNN adapters Original
    clean
    10 dB
    -5 dB
  • speech (speech-librivox-0172)
  • Reference speech Enhanced reference speech Without fine-tuning without adapters With bottleneck and CNN adapters Original
    clean
    10 dB
    -5 dB
  • general noise (noise-free-sound-0826)
  • Reference speech Enhanced reference speech Without fine-tuning without adapters With bottleneck and CNN adapters Original
    clean
    10 dB
    -5 dB

    Male

  • music (music-jamendo-0205)
  • Reference speech Enhanced reference speech Without fine-tuning without adapters With bottleneck and CNN adapters Original
    clean
    10 dB
    -5 dB
  • speech (speech-us-gov-0238)
  • Reference speech Enhanced reference speech Without fine-tuning without adapters With bottleneck and CNN adapters Original
    clean
    10 dB
    -5 dB
  • general noise (noise-free-sound-0768)
  • Reference speech Enhanced reference speech Without fine-tuning without adapters With bottleneck and CNN adapters Original
    clean
    10 dB
    -5 dB