VC-T: Streaming voice conversion based on neural transducer

Authors

Hiroki Kanagawa*, Takafumi Moriya*, Yusuke Ijima (NTT corporation, Japan)
* equal contribution

Paper

ISCA Interspeech 2023 on 20th-24th August 2023

Note

If you are having trouble listening to the audios, try refreshing the page.

Contents

Comparison methods

We trained three types of one-to-one VC models. Each model correspond to homo-gender conversion (Female-to-Female: F2F), and hetero-gender conversion (Male-to-Female: M2F, and Female-to-Male: F2M).

In our experiments, we compared audio obtained by these methods following as:

Audio samples

Female-to-Female conversion

Utterance Source speaker Target speaker
(oracle)
Converted speech
- - GT Resyn BNE-S2SMoL-VC ConvS2S-VC (offline) ConvS2S-VC (stream) VC-T (offline) VC-T (stream)
sample 1
sample 2
sample 3

Female-to-Male conversion

Utterance Source speaker Target speaker
(oracle)
Converted speech
- - GT Resyn BNE-S2SMoL-VC ConvS2S-VC (offline) ConvS2S-VC (stream) VC-T (offline) VC-T (stream)
sample 1
sample 2
sample 3

Male-to-Female conversion

Utterance Source speaker Target speaker
(oracle)
Converted speech
- - GT Resyn BNE-S2SMoL-VC ConvS2S-VC (offline) ConvS2S-VC (stream) VC-T (offline) VC-T (stream)
sample 1
sample 2
sample 3