Voice Impression Control in Zero-Shot TTS

If you are having trouble listening to the audios, try refreshing the page.

Single dimension impression modulation on zero-shot TTS

Speech samples (female)

Speech samples (male)

Zero-shot impression control with automatic mapping from freely-described impresions using LLM

Prompt example

Speech samples

Samples

HiFi-GAN was used for all speech sample generation.

All speakers in this page are chosen from test data, i.e., generated with zero-shot condition.

Single dimension impression modulation on zero-shot TTS

Altering specific dimensions of the speech impression vector

Speaker1 (Female)

Reference Speech

Modulated Speech

	A) High – Low pitched	B) Masculine – Feminine	C) Clear – Hoarse	D) Calm – Restless	E) Powerful – Weak	F) Youthful – Elderly	G) Thick – Thin	H) Tense – Relaxed	I) Dark – Bright	J) Cold – Warm	K) Slow – Fast
−3
−2
−1
0
+1
+2
+3

Speaker2 (Male)

Reference Speech

Modulated Speech

	A) High – Low pitched	B) Masculine – Feminine	C) Clear – Hoarse	D) Calm – Restless	E) Powerful – Weak	F) Youthful – Elderly	G) Thick – Thin	H) Tense – Relaxed	I) Dard – Bright	J) Cold – Warm	K) Slow – Fast
−3
−2
−1
0
+1
+2
+3

Zero-shot impression control with automatic mapping from freely-described impresions using LLM

Leveraging an LLM(ChatGPT-4o) to automate the determination of optimal impression vectors

Prompt example (Japanese)

	# Task
	## 声の印象について、次の11個の印象を1から7の連続値で評価したデータがあります。
	## 各印象に対するスコアは，それぞれ平静時のスコアを表す，
	## 目標に示す話し方で話してもらった時のスコアを作ってください
	## 出力は，印象ごとのjson形式にしてください
	
	# 印象
	## 高い (1) – 低い (7): 4.68
	## 男性 (1) – 女性 (7): 1.45
	## かすれた (1) – 澄んだ (7): 4.16
	## 落ち着きのある (1) – 落ち着きのない (7): 3.54
	## 迫力ある (1) – 弱弱しい (7): 4.03
	## 若い (1) – 老けた (7): 3.99
	## 太い (1) – 細い (7): 3.33
	## 張りのある (1) – 張りのない (7): 4.10
	## 暗い (1) – 明るい (7): 3.84
	## 冷たい (1) – 温かい (7): 3.88
	## 遅い (1) – 速い (7): 4.20

	# 目標
	## 眠そうに

Prompt example (English translated from Japanese)

	# Task
	## We have data evaluating the impression of a voice based on the following 11 impressions, using a continuous scale from 1 to 7.
	## The score for each impression represents the score in a neutral state.
	## Generate scores for when the speaker speaks in the target speaking style.
	## The output should be in JSON format for each impression.
	
	# Impresseions
	## High (1) – Low (7): 4.68
	## Masculine(1) – Feminine (7): 1.45
	## Clear (1) – Hoarse (7): 4.16
	## Calm (1) – Restless (7): 3.54
	## Powerful (1) – Weak (7): 4.03
	## Youthful (1) – Elderly (7): 3.99
	## Thick (1) – Thin (7): 3.33
	## Tense (1) – Relaxed (7): 4.1
	## Dark (1) – Bright (7): 3.84
	## Cold (1) – Warm (7): 3.88
	## Slow (1) – Fast (7): 4.20

	# Target
	## Sleepy

Output from LLM (English translated from Japanese)

no modulation (female)

	# Impressions
	## High (1) – Low (7): 2.10
	## Masculine (1) – Feminine (7): 5.31
	## Clear (1) – Hoarse (7): 5.26
	## Calm (1) – Restless (7): 4.83
	## Powerful (1) – Weak (7): 3.17
	## Youthful (1) – Elderly (7): 1.93
	## Thick (1) – Thin (7): 5.23
	## Tense (1) – Relaxed (7): 2.70
	## Dark (1) – Bright (7): 5.37
	## Cold (1) – Warm (7): 4.17
	## Slow (1) – Fast (7): 4.35

Sleepy (眠そうに)

	{
		"High (1) – Low (7)": {
			"score": 2.6,
		},
		"Masculine (1) – Feminine (7)": {
			"score": 5.31,
		},
		"Clear (1) – Hoarse (7)": {
			"score": 4.26,
		},
		"Calm (1) – Restless (7)": {
			"score": 3.83,
		},
		"Powerful (1) – Weak (7)": {
			"score": 4.17,
		},
		"Youthful (1) – Elderly (7)": {
			"score": 2.43,
		},
		"Thick (1) – Thin (7)": {
			"score": 5.23,
		},
		"Tense (1) – Relaxed (7)": {
			"score": 4.2,
		},
		"Dark (1) – Bright (7)": {
			"score": 4.37,
		},
		"Cold (1) – Warm (7)": {
			"score": 4.17,
		},
		"Slow (1) – Fast (7)": {
			"score": 3.35,
		}
	}

Urgent, attention grabbing

	{
		"High (1) – Low (7)": {
			"score": 2.50,
		},
		"Masculine (1) – Feminine (7)": {
			"score": 5.00,
		},
		"Clear (1) – Hoarse (7)": {
			"score": 5.80,
		},
		"Calm (1) – Restless (7)": {
			"score": 6.30,
		},
		"Powerful (1) – Weak (7)": {
			"score": 2.50,
		},
		"Youthful (1) – Elderly (7)": {
			"score": 2.50,
		},
		"Thick (1) – Thin (7)": {
			"score": 5.00,
		},
		"Tense (1) – Relaxed (7)": {
			"score": 3.00,
		},
		"Dark (1) – Bright (7)": {
			"score": 6.00,
		},
		"Cold (1) – Warm (7)": {
			"score": 4.00,
		},
		"Slow (1) – Fast (7)": {
			"score": 6.00,
		}
	}

Output from LLM (English translated from Japanese)

no modulation (male)

	# Impressions
	## High (1) – Low (7): 4.68
	## Masculine (1) – Feminine (7): 1.45
	## Clear (1) – Hoarse (7): 4.16
	## Calm (1) – Restless (7): 3.54
	## Powerful (1) – Weak (7): 4.03
	## Youthful (1) – Elderly (7): 3.99
	## Thick (1) – Thin (7): 3.33
	## Tense (1) – Relaxed (7): 4.10
	## Dark (1) – Bright (7): 3.84
	## Cold (1) – Warm (7): 3.88
	## Slow (1) – Fast (7): 4.20

Sleepy (眠そうに)

	{
		"High (1) – Low (7)": {
			"score": 5.20,
		},
		"Masculine (1) – Feminine (7)": {
			"score": 1.60,
		},
		"Clear (1) – Hoarse (7)": {
			"score": 3.80,
		},
		"Calm (1) – Restless (7)": {
			"score": 2.80,
		},
		"Powerful (1) – Weak (7)": {
			"score": 5.00,
		},
		"Youthful (1) – Elderly (7)": {
			"score": 4.20,
		},
		"Thick (1) – Thin (7)": {
			"score": 3.20,
		},
		"Tense (1) – Relaxed (7)": {
			"score": 5.00,
		},
		"Dark (1) – Bright (7)": {
			"score": 3.50,
		},
		"Cold (1) – Warm (7)": {
			"score": 4.00,
		},
		"Slow (1) – Fast (7)": {
			"score": 3.50,
		}
	}

Urgent, attention grabbing (緊急で注意喚起するように)

	{
		"High (1) – Low (7)": {
			"score": 3.10,
		},
		"Masculine (1) – Feminine (7)": {
			"score": 1.80,
		},
		"Clear (1) – Hoarse (7)": {
			"score": 4.80,
		},
		"Calm (1) – Restless (7)": {
			"score": 6.20,
		},
		"Powerful (1) – Weak (7)": {
			"score": 2.50,
		},
		"Youthful (1) – Elderly (7)": {
			"score": 3.20,
		},
		"Thick (1) – Thin (7)": {
			"score": 3.50,
		},
		"Tense (1) – Relaxed (7)": {
			"score": 3.00,
		},
		"Dark (1) – Bright (7)": {
			"score": 3.80,
		},
		"Cold (1) – Warm (7)": {
			"score": 3.40,
		},
		"Slow (1) – Fast (7)": {
			"score": 5.80,
		}
	}

Speech Samples

Female

condition	speech
reference
no modulation
target: ``sleepy''
target: ``urgent''

Male

condition	speech
reference
no modulation
target: ``sleepy''
target: ``urgent''

Voice Impression Control in Zero-Shot TTS

Kenichi Fujita, Shota Horiguchi, Yusuke Ijima

NTT, Inc., Japan

Contents

Samples

Single dimension impression modulation on zero-shot TTS

Speaker1 (Female)

Speaker2 (Male)

Zero-shot impression control with automatic mapping from freely-described impresions using LLM

Prompt example (Japanese)

Prompt example (English translated from Japanese)

Output from LLM (English translated from Japanese)

Output from LLM (English translated from Japanese)

Speech Samples