Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

Samples

HiFi-GAN was used for all speech sample generation.

Fine-tuning HiFi-GAN to each TTS model would improve naturalness and similarity, but to compare the performance of the TTS models, we did not use fine-tuned HiFi-GANs.

All speakers in this page are chosen from test data.

Performance comparison on zero-shot TTS

Small, Medium, Large: the baseline models with various size of parameters.

MoA(dense), MoA(sparse): Proposed methods. Mixture of Adapter modules are inserted into the Small model.

"Original" in table is the target speaker's speech with the same speech contents for the generated speech. Note that this is not the reference speech.

Samples

Models	Reference speech	Small	Medium	Large	Proposed(dense)	Proposed(sparse)	Original
Female(nonprofessional)
Male(nonprofessional)
Female(professional)
Male(professional)