Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

Anonymous submission to Interspeech 2024

Samples

  • HiFi-GAN was used for all speech sample generation.
  • Fine-tuning HiFi-GAN to each TTS model would improve naturalness and similarity, but to compare the performance of the TTS models, we did not use fine-tuned HiFi-GANs.
  • All speakers in this page are chosen from test data.
  • Performance comparison on zero-shot TTS

  • Small, Medium, Large: the baseline models with various size of parameters.
  • MoA(dense), MoA(sparse): Proposed methods. Mixture of Adapter modules are inserted into the Small model.
  • "Original" in table is the target speaker's speech with the same speech contents for the generated speech. Note that this is not the reference speech.
  • Samples

    Models Reference speech Small Medium Large Proposed(dense) Proposed(sparse) Original
    Female(nonprofessional)
    Male(nonprofessional)
    Female(professional)
    Male(professional)