We trained three types of one-to-one VC models.
Each model correspond to homo-gender conversion (Female-to-Female: F2F), and hetero-gender conversion (Male-to-Female: M2F, and Female-to-Male: F2M).
In our experiments, we compared audio obtained by these methods following as:
[oracle] Ground truth (GT)
Recorded speech
[oracle] Analytic-resynthesis (Resyn)
extracted spectra were just vocoded by HiFi-GAN (v1)
BNE-S2SMoL-VC [Liu+, '21] (Section 2.2 in paper)
PPG-based encoder-decoder model, which employed the hybrid CTC-attention based ASR model as VC encoder; its bottleneck features were mapped to the target speaker's spectra using a mixed logistic attention decoder. Note that the VC encoder must be well trained in advance.
ConvS2S-VC (offline) [Kameoka+, '20] (Section 2.1 in paper)
Fully convolutional seq2seq VC. Conversion with this model can be directly done without any pretraining such as BNE-S2SMoL-VC.
ConvS2S-VC (stream)
A streaming version of ConvS2S-VC. This model replaces the non-causal convolution with causal one to realize streaming operation.
Proposed VC-T (offline) (Section 3 in paper)
One of our proposal RNNT-based VC variants; This encoder utilize the whole of the source speaker's speech. then both encoded features and previous target speaker's encoded ones are fed to joint network in the RNNT's inference manner.
Proposed VC-T(stream)
The streaming version of VC-T. The source speaker encoder see the only current and past contexts for streaming operation. Note that we build it using a GRU-based module due to realize it by simple implementations.