If you are having trouble listening to the audios, try refreshing the page.
Contents
Single dimension impression modulation on zero-shot TTS
Speech samples (female)
Speech samples (male)
Zero-shot impression control with automatic mapping from freely-described impresions using LLM
Prompt example
Speech samples
Samples
HiFi-GAN was used for all speech sample generation.
All speakers in this page are chosen from test data, i.e., generated with zero-shot condition.
Single dimension impression modulation on zero-shot TTS
Altering specific dimensions of the speech impression vector
Speaker1 (Female)
Reference Speech
Modulated Speech
A) High – Low pitched
B) Masculine – Feminine
C) Clear – Hoarse
D) Calm – Restless
E) Powerful – Weak
F) Youthful – Elderly
G) Thick – Thin
H) Tense – Relaxed
I) Dark – Bright
J) Cold – Warm
K) Slow – Fast
−3
−2
−1
0
+1
+2
+3
Speaker2 (Male)
Reference Speech
Modulated Speech
A) High – Low pitched
B) Masculine – Feminine
C) Clear – Hoarse
D) Calm – Restless
E) Powerful – Weak
F) Youthful – Elderly
G) Thick – Thin
H) Tense – Relaxed
I) Dard – Bright
J) Cold – Warm
K) Slow – Fast
−3
−2
−1
0
+1
+2
+3
Zero-shot impression control with automatic mapping from freely-described impresions using LLM
Leveraging an LLM(ChatGPT-4o) to automate the determination of optimal impression vectors
Prompt example (Japanese)
# Task
## 声の印象について、次の11個の印象を1から7の連続値で評価したデータがあります。
## 各印象に対するスコアは,それぞれ平静時のスコアを表す,
## 目標に示す話し方で話してもらった時のスコアを作ってください
## 出力は,印象ごとのjson形式にしてください
# 印象
## 高い (1) – 低い (7): 4.68
## 男性 (1) – 女性 (7): 1.45
## かすれた (1) – 澄んだ (7): 4.16
## 落ち着きのある (1) – 落ち着きのない (7): 3.54
## 迫力ある (1) – 弱弱しい (7): 4.03
## 若い (1) – 老けた (7): 3.99
## 太い (1) – 細い (7): 3.33
## 張りのある (1) – 張りのない (7): 4.10
## 暗い (1) – 明るい (7): 3.84
## 冷たい (1) – 温かい (7): 3.88
## 遅い (1) – 速い (7): 4.20
# 目標
## 眠そうに
Prompt example (English translated from Japanese)
# Task
## We have data evaluating the impression of a voice based on the following 11 impressions, using a continuous scale from 1 to 7.
## The score for each impression represents the score in a neutral state.
## Generate scores for when the speaker speaks in the target speaking style.
## The output should be in JSON format for each impression.
# Impresseions
## High (1) – Low (7): 4.68
## Masculine(1) – Feminine (7): 1.45
## Clear (1) – Hoarse (7): 4.16
## Calm (1) – Restless (7): 3.54
## Powerful (1) – Weak (7): 4.03
## Youthful (1) – Elderly (7): 3.99
## Thick (1) – Thin (7): 3.33
## Tense (1) – Relaxed (7): 4.1
## Dark (1) – Bright (7): 3.84
## Cold (1) – Warm (7): 3.88
## Slow (1) – Fast (7): 4.20
# Target
## Sleepy
Output from LLM (English translated from Japanese)
no modulation (female)
# Impressions
## High (1) – Low (7): 2.10
## Masculine (1) – Feminine (7): 5.31
## Clear (1) – Hoarse (7): 5.26
## Calm (1) – Restless (7): 4.83
## Powerful (1) – Weak (7): 3.17
## Youthful (1) – Elderly (7): 1.93
## Thick (1) – Thin (7): 5.23
## Tense (1) – Relaxed (7): 2.70
## Dark (1) – Bright (7): 5.37
## Cold (1) – Warm (7): 4.17
## Slow (1) – Fast (7): 4.35
Sleepy (眠そうに)
{
"High (1) – Low (7)": {
"score": 2.6,
},
"Masculine (1) – Feminine (7)": {
"score": 5.31,
},
"Clear (1) – Hoarse (7)": {
"score": 4.26,
},
"Calm (1) – Restless (7)": {
"score": 3.83,
},
"Powerful (1) – Weak (7)": {
"score": 4.17,
},
"Youthful (1) – Elderly (7)": {
"score": 2.43,
},
"Thick (1) – Thin (7)": {
"score": 5.23,
},
"Tense (1) – Relaxed (7)": {
"score": 4.2,
},
"Dark (1) – Bright (7)": {
"score": 4.37,
},
"Cold (1) – Warm (7)": {
"score": 4.17,
},
"Slow (1) – Fast (7)": {
"score": 3.35,
}
}
Urgent, attention grabbing
{
"High (1) – Low (7)": {
"score": 2.50,
},
"Masculine (1) – Feminine (7)": {
"score": 5.00,
},
"Clear (1) – Hoarse (7)": {
"score": 5.80,
},
"Calm (1) – Restless (7)": {
"score": 6.30,
},
"Powerful (1) – Weak (7)": {
"score": 2.50,
},
"Youthful (1) – Elderly (7)": {
"score": 2.50,
},
"Thick (1) – Thin (7)": {
"score": 5.00,
},
"Tense (1) – Relaxed (7)": {
"score": 3.00,
},
"Dark (1) – Bright (7)": {
"score": 6.00,
},
"Cold (1) – Warm (7)": {
"score": 4.00,
},
"Slow (1) – Fast (7)": {
"score": 6.00,
}
}
Output from LLM (English translated from Japanese)
no modulation (male)
# Impressions
## High (1) – Low (7): 4.68
## Masculine (1) – Feminine (7): 1.45
## Clear (1) – Hoarse (7): 4.16
## Calm (1) – Restless (7): 3.54
## Powerful (1) – Weak (7): 4.03
## Youthful (1) – Elderly (7): 3.99
## Thick (1) – Thin (7): 3.33
## Tense (1) – Relaxed (7): 4.10
## Dark (1) – Bright (7): 3.84
## Cold (1) – Warm (7): 3.88
## Slow (1) – Fast (7): 4.20
Sleepy (眠そうに)
{
"High (1) – Low (7)": {
"score": 5.20,
},
"Masculine (1) – Feminine (7)": {
"score": 1.60,
},
"Clear (1) – Hoarse (7)": {
"score": 3.80,
},
"Calm (1) – Restless (7)": {
"score": 2.80,
},
"Powerful (1) – Weak (7)": {
"score": 5.00,
},
"Youthful (1) – Elderly (7)": {
"score": 4.20,
},
"Thick (1) – Thin (7)": {
"score": 3.20,
},
"Tense (1) – Relaxed (7)": {
"score": 5.00,
},
"Dark (1) – Bright (7)": {
"score": 3.50,
},
"Cold (1) – Warm (7)": {
"score": 4.00,
},
"Slow (1) – Fast (7)": {
"score": 3.50,
}
}
Urgent, attention grabbing (緊急で注意喚起するように)
{
"High (1) – Low (7)": {
"score": 3.10,
},
"Masculine (1) – Feminine (7)": {
"score": 1.80,
},
"Clear (1) – Hoarse (7)": {
"score": 4.80,
},
"Calm (1) – Restless (7)": {
"score": 6.20,
},
"Powerful (1) – Weak (7)": {
"score": 2.50,
},
"Youthful (1) – Elderly (7)": {
"score": 3.20,
},
"Thick (1) – Thin (7)": {
"score": 3.50,
},
"Tense (1) – Relaxed (7)": {
"score": 3.00,
},
"Dark (1) – Bright (7)": {
"score": 3.80,
},
"Cold (1) – Warm (7)": {
"score": 3.40,
},
"Slow (1) – Fast (7)": {
"score": 5.80,
}
}
Speech Samples
Female
condition
speech
reference
no modulation
target: ``sleepy''
target: ``urgent''
Male
condition
speech
reference
no modulation
target: ``sleepy''
target: ``urgent''