VC-T: Streaming voice conversion based on neural transducer

Authors

Hiroki Kanagawa*, Takafumi Moriya*, Yusuke Ijima (NTT corporation, Japan)
* equal contribution

Paper

ISCA Interspeech 2023 on 20th-24th August 2023

Note

If you are having trouble listening to the audios, try refreshing the page.

Comparison methods
Audio samples

Comparison methods

We trained three types of one-to-one VC models. Each model correspond to homo-gender conversion (Female-to-Female: F2F), and hetero-gender conversion (Male-to-Female: M2F, and Female-to-Male: F2M).

In our experiments, we compared audio obtained by these methods following as:

[oracle] Ground truth (GT)
- Recorded speech
[oracle] Analytic-resynthesis (Resyn)
- extracted spectra were just vocoded by HiFi-GAN (v1)
BNE-S2SMoL-VC [Liu+, '21] (Section 2.2 in paper)
- PPG-based encoder-decoder model, which employed the hybrid CTC-attention based ASR model as VC encoder; its bottleneck features were mapped to the target speaker's spectra using a mixed logistic attention decoder. Note that the VC encoder must be well trained in advance.
ConvS2S-VC (offline) [Kameoka+, '20] (Section 2.1 in paper)
- Fully convolutional seq2seq VC. Conversion with this model can be directly done without any pretraining such as BNE-S2SMoL-VC.
ConvS2S-VC (stream)
- A streaming version of ConvS2S-VC. This model replaces the non-causal convolution with causal one to realize streaming operation.
Proposed VC-T (offline) (Section 3 in paper)
- One of our proposal RNNT-based VC variants; This encoder utilize the whole of the source speaker's speech. then both encoded features and previous target speaker's encoded ones are fed to joint network in the RNNT's inference manner.
Proposed VC-T(stream)
- The streaming version of VC-T. The source speaker encoder see the only current and past contexts for streaming operation. Note that we build it using a GRU-based module due to realize it by simple implementations.

Audio samples

Female-to-Female conversion

Utterance	Source speaker	Target speaker (oracle)		Converted speech
-	-	GT	Resyn	BNE-S2SMoL-VC	ConvS2S-VC (offline)	ConvS2S-VC (stream)	VC-T (offline)	VC-T (stream)
sample 1	▶	▶	▶	▶	▶	▶	▶	▶
sample 2	▶	▶	▶	▶	▶	▶	▶	▶
sample 3	▶	▶	▶	▶	▶	▶	▶	▶

Female-to-Male conversion

Utterance	Source speaker	Target speaker (oracle)		Converted speech
-	-	GT	Resyn	BNE-S2SMoL-VC	ConvS2S-VC (offline)	ConvS2S-VC (stream)	VC-T (offline)	VC-T (stream)
sample 1	▶	▶	▶	▶	▶	▶	▶	▶
sample 2	▶	▶	▶	▶	▶	▶	▶	▶
sample 3	▶	▶	▶	▶	▶	▶	▶	▶

Male-to-Female conversion

Utterance	Source speaker	Target speaker (oracle)		Converted speech
-	-	GT	Resyn	BNE-S2SMoL-VC	ConvS2S-VC (offline)	ConvS2S-VC (stream)	VC-T (offline)	VC-T (stream)
sample 1	▶	▶	▶	▶	▶	▶	▶	▶
sample 2	▶	▶	▶	▶	▶	▶	▶	▶
sample 3	▶	▶	▶	▶	▶	▶	▶	▶