Voice Impression Control in Zero-Shot TTS

Kenichi Fujita, Shota Horiguchi, Yusuke Ijima

NTT, Inc., Japan

  • If you are having trouble listening to the audios, try refreshing the page.
  • Contents

  • Single dimension impression modulation on zero-shot TTS
  • Speech samples (female)
  • Speech samples (male)
  • Zero-shot impression control with automatic mapping from freely-described impresions using LLM
  • Prompt example
  • Speech samples
  • Samples

  • HiFi-GAN was used for all speech sample generation.
  • All speakers in this page are chosen from test data, i.e., generated with zero-shot condition.
  • Single dimension impression modulation on zero-shot TTS

  • Altering specific dimensions of the speech impression vector
  • Speaker1 (Female)

  • Reference Speech
  • Modulated Speech
  • A) High – Low pitched B) Masculine – Feminine C) Clear – Hoarse D) Calm – Restless E) Powerful – Weak F) Youthful – Elderly G) Thick – Thin H) Tense – Relaxed I) Dark – Bright J) Cold – Warm K) Slow – Fast
    −3
    −2
    −1
    0
    +1
    +2
    +3

    Speaker2 (Male)

  • Reference Speech
  • Modulated Speech
  • A) High – Low pitched B) Masculine – Feminine C) Clear – Hoarse D) Calm – Restless E) Powerful – Weak F) Youthful – Elderly G) Thick – Thin H) Tense – Relaxed I) Dard – Bright J) Cold – Warm K) Slow – Fast
    −3
    −2
    −1
    0
    +1
    +2
    +3

    Zero-shot impression control with automatic mapping from freely-described impresions using LLM

  • Leveraging an LLM(ChatGPT-4o) to automate the determination of optimal impression vectors
  • Prompt example (Japanese)

    	# Task
    	## 声の印象について、次の11個の印象を1から7の連続値で評価したデータがあります。
    	## 各印象に対するスコアは,それぞれ平静時のスコアを表す,
    	## 目標に示す話し方で話してもらった時のスコアを作ってください
    	## 出力は,印象ごとのjson形式にしてください
    	
    	# 印象
    	## 高い (1) – 低い (7): 4.68
    	## 男性 (1) – 女性 (7): 1.45
    	## かすれた (1) – 澄んだ (7): 4.16
    	## 落ち着きのある (1) – 落ち着きのない (7): 3.54
    	## 迫力ある (1) – 弱弱しい (7): 4.03
    	## 若い (1) – 老けた (7): 3.99
    	## 太い (1) – 細い (7): 3.33
    	## 張りのある (1) – 張りのない (7): 4.10
    	## 暗い (1) – 明るい (7): 3.84
    	## 冷たい (1) – 温かい (7): 3.88
    	## 遅い (1) – 速い (7): 4.20
    
    	# 目標
    	## 眠そうに
    
    

    Prompt example (English translated from Japanese)

    	# Task
    	## We have data evaluating the impression of a voice based on the following 11 impressions, using a continuous scale from 1 to 7.
    	## The score for each impression represents the score in a neutral state.
    	## Generate scores for when the speaker speaks in the target speaking style.
    	## The output should be in JSON format for each impression.
    	
    	# Impresseions
    	## High (1) – Low (7): 4.68
    	## Masculine(1) – Feminine (7): 1.45
    	## Clear (1) – Hoarse (7): 4.16
    	## Calm (1) – Restless (7): 3.54
    	## Powerful (1) – Weak (7): 4.03
    	## Youthful (1) – Elderly (7): 3.99
    	## Thick (1) – Thin (7): 3.33
    	## Tense (1) – Relaxed (7): 4.1
    	## Dark (1) – Bright (7): 3.84
    	## Cold (1) – Warm (7): 3.88
    	## Slow (1) – Fast (7): 4.20
    
    	# Target
    	## Sleepy
    
    

    Output from LLM (English translated from Japanese)

  • no modulation (female)
  • 	# Impressions
    	## High (1) – Low (7): 2.10
    	## Masculine (1) – Feminine (7): 5.31
    	## Clear (1) – Hoarse (7): 5.26
    	## Calm (1) – Restless (7): 4.83
    	## Powerful (1) – Weak (7): 3.17
    	## Youthful (1) – Elderly (7): 1.93
    	## Thick (1) – Thin (7): 5.23
    	## Tense (1) – Relaxed (7): 2.70
    	## Dark (1) – Bright (7): 5.37
    	## Cold (1) – Warm (7): 4.17
    	## Slow (1) – Fast (7): 4.35
    
  • Sleepy (眠そうに)
  • 	{
    		"High (1) – Low (7)": {
    			"score": 2.6,
    		},
    		"Masculine (1) – Feminine (7)": {
    			"score": 5.31,
    		},
    		"Clear (1) – Hoarse (7)": {
    			"score": 4.26,
    		},
    		"Calm (1) – Restless (7)": {
    			"score": 3.83,
    		},
    		"Powerful (1) – Weak (7)": {
    			"score": 4.17,
    		},
    		"Youthful (1) – Elderly (7)": {
    			"score": 2.43,
    		},
    		"Thick (1) – Thin (7)": {
    			"score": 5.23,
    		},
    		"Tense (1) – Relaxed (7)": {
    			"score": 4.2,
    		},
    		"Dark (1) – Bright (7)": {
    			"score": 4.37,
    		},
    		"Cold (1) – Warm (7)": {
    			"score": 4.17,
    		},
    		"Slow (1) – Fast (7)": {
    			"score": 3.35,
    		}
    	}
    
  • Urgent, attention grabbing
  • 	{
    		"High (1) – Low (7)": {
    			"score": 2.50,
    		},
    		"Masculine (1) – Feminine (7)": {
    			"score": 5.00,
    		},
    		"Clear (1) – Hoarse (7)": {
    			"score": 5.80,
    		},
    		"Calm (1) – Restless (7)": {
    			"score": 6.30,
    		},
    		"Powerful (1) – Weak (7)": {
    			"score": 2.50,
    		},
    		"Youthful (1) – Elderly (7)": {
    			"score": 2.50,
    		},
    		"Thick (1) – Thin (7)": {
    			"score": 5.00,
    		},
    		"Tense (1) – Relaxed (7)": {
    			"score": 3.00,
    		},
    		"Dark (1) – Bright (7)": {
    			"score": 6.00,
    		},
    		"Cold (1) – Warm (7)": {
    			"score": 4.00,
    		},
    		"Slow (1) – Fast (7)": {
    			"score": 6.00,
    		}
    	}
    
    

    Output from LLM (English translated from Japanese)

  • no modulation (male)
  • 	# Impressions
    	## High (1) – Low (7): 4.68
    	## Masculine (1) – Feminine (7): 1.45
    	## Clear (1) – Hoarse (7): 4.16
    	## Calm (1) – Restless (7): 3.54
    	## Powerful (1) – Weak (7): 4.03
    	## Youthful (1) – Elderly (7): 3.99
    	## Thick (1) – Thin (7): 3.33
    	## Tense (1) – Relaxed (7): 4.10
    	## Dark (1) – Bright (7): 3.84
    	## Cold (1) – Warm (7): 3.88
    	## Slow (1) – Fast (7): 4.20
    
  • Sleepy (眠そうに)
  • 	{
    		"High (1) – Low (7)": {
    			"score": 5.20,
    		},
    		"Masculine (1) – Feminine (7)": {
    			"score": 1.60,
    		},
    		"Clear (1) – Hoarse (7)": {
    			"score": 3.80,
    		},
    		"Calm (1) – Restless (7)": {
    			"score": 2.80,
    		},
    		"Powerful (1) – Weak (7)": {
    			"score": 5.00,
    		},
    		"Youthful (1) – Elderly (7)": {
    			"score": 4.20,
    		},
    		"Thick (1) – Thin (7)": {
    			"score": 3.20,
    		},
    		"Tense (1) – Relaxed (7)": {
    			"score": 5.00,
    		},
    		"Dark (1) – Bright (7)": {
    			"score": 3.50,
    		},
    		"Cold (1) – Warm (7)": {
    			"score": 4.00,
    		},
    		"Slow (1) – Fast (7)": {
    			"score": 3.50,
    		}
    	}
    
  • Urgent, attention grabbing (緊急で注意喚起するように)
  • 	{
    		"High (1) – Low (7)": {
    			"score": 3.10,
    		},
    		"Masculine (1) – Feminine (7)": {
    			"score": 1.80,
    		},
    		"Clear (1) – Hoarse (7)": {
    			"score": 4.80,
    		},
    		"Calm (1) – Restless (7)": {
    			"score": 6.20,
    		},
    		"Powerful (1) – Weak (7)": {
    			"score": 2.50,
    		},
    		"Youthful (1) – Elderly (7)": {
    			"score": 3.20,
    		},
    		"Thick (1) – Thin (7)": {
    			"score": 3.50,
    		},
    		"Tense (1) – Relaxed (7)": {
    			"score": 3.00,
    		},
    		"Dark (1) – Bright (7)": {
    			"score": 3.80,
    		},
    		"Cold (1) – Warm (7)": {
    			"score": 3.40,
    		},
    		"Slow (1) – Fast (7)": {
    			"score": 5.80,
    		}
    	}
    
    

    Speech Samples

  • Female
  • condition speech
    reference
    no modulation
    target: ``sleepy''
    target: ``urgent''
  • Male
  • condition speech
    reference
    no modulation
    target: ``sleepy''
    target: ``urgent''