StyleCap: Automatic Speaking-Style Captioning from Speech
Based on Speech and Language Self-supervised Learning Models

Kazuki Yamauchi*, Yusuke Ijima, Yuki Saito*
*The University of Tokyo, Japan, NTT Corporation, Japan.

Abstract

We propose StyleCap, a method to generate natural language descriptions of speaking styles appearing in speech. Although most of conventional techniques for para-/non-linguistic information recognition focus on the category classification or the intensity estimation of pre-defined labels, they cannot provide the reasoning of the recognition result in an interpretable manner. As a first step towards an end-to-end method for generating speaking-style prompts from speech, i.e., automatic speaking-style captioning, our StyleCap uses paired data of speech and natural language descriptions to train neural networks that predict prefix vectors fed into a large language model (LLM)-based text decoder from a speech representation vector. We explore an appropriate text decoder and speech feature representation suitable for this new task. The experimental results demonstrate that our StyleCap leveraging richer LLMs for the text decoder, speech self-supervised learning (SSL) features, and sentence rephrasing augmentation improves the accuracy and diversity of generated speaking-style captions.

Content

1. Dataset

We used PromptSpeech [1], which includes natural language instructions (style prompt) of various speaking styles. This dataset was constructed for PromptTTS [1] that can synthesize speech in accordance with a style prompt. We used the training subset of PromptSpeech real version, which includes 26,588 human-annotated style prompts of multi-speakers' speech samples from LibriTTS [2]. We divided the 26,588 prompts into training (24,953 by 1,113 speakers), development (857 by 40), and test (778 by 38) subsets and paired them with the corresponding speech samples from LibriTTS.

The related data is available for download from the following link.
PromptSpeech
LibriTTS corpus
train_dev_test_set.zip (Each subset's file id list)

2. Data augmentation by rephrasing sentences using an LLM

We introduce sentence rephrasing augmentation using an LLM to increase the diversity of speaking-style descriptions in the training data. Specifically, we first ask the pre-trained Llama 2-Chat (7B) [3], which is an LLM optimized for dialogue use cases, to generate five sentences from one given sentence on the basis of the following prompt:

"Rewrite the following sentence that describes someone's style of speaking in a different way, keeping the meaning of the original description. Original Description: [subject to be rephrased]."

Then, we select one sentence from the five rephrased sentences on the basis of their BERTScore [4] values: i.e., we pick on the sentence with the highest score if the score is higher than 0.80.

Samples of sentence rephrasing

Original caption Rewritten caption
His sound height is normal, but the speed is very fast, and the volume is very low. His voice is normal in tone, but he speaks with a rapid pace and a soft volume.
A man talking slowly, but the sound is high. A man speaking at a measured pace, with a soaring tone.
A woman uses a relatively low tone and fast speaking speed and say loudly. A woman speaks with a mellow tone and rapid-fire delivery, her words tumbling out quickly.
A woman's voice is so quiet, but the tone is very high. A woman's voice is soft, yet her pitch is incredibly high.
The speaking speed of women is very slow, but her voice is very sharp and her sound is quiet. Her speaking style is slow and deliberate, but her voice is sharp and her tone is gentle.
She has a normal volume, but she speaks very slowly her sound height is really high. She has a standard volume, but her speech is glacially paced.
A boy speaks very slowly, but his tone is very high. The boy speaks at a deliberate pace, but his voice is pitched at an unusual height.
Please generate a man with a loud sound, but a very low tone to speak. The man has a booming voice, but speaks in a mellow tone.
Loud volume, masculine, high sound, but his speed is no different from ordinary people. Loud pitch, rugged tone, same velocity as average folks.
Ask a lady to talk loudly at a relatively high tone and very slow speaking speed. Ask a lady to speak animatedly at a slightly higher pitch and very slow speech rate.

3. Samples of generated captions

In this section, we present samples of generated captions for the test set using each of the comparison models explained in section 3.2 of our paper. The table demonstrates the samples when each model was trained both without sentence rephrasing augmentation and with sentence rephrasing augmentation.

Speech
Ground truth caption His tone is so high, the volume is very large, and he speak very fast. Low volume, but her speed is no different from ordinary people. When a man is very loud but the tone is very low. A man speaks average, a normal tone general tone. Let a girl with a sharp tone fast but the sound is very quiet. Her voice is big, but the speed of speech is extremely fast. A female bass normal speed and whispering. Please help me generate a woman with normal tone but low speaking volume and slow speaking speed. This man said loudly that his speed is very slow, but the pitch is high. This woman whispered, her speech was very slow, but her voice was very shrilly.

Without sentence rephrasing augmentation

Transformer-based encoder-decoder mel-spectrogram His sound height is really high, the volume is very large, and he speaks quickly. Please give me a normal tone with fast speaking and small voicewomen speak. A male bass whispering in low pitch. Normal tone, this man said slowly. A female treble talked normally. Please give me a normal tone with fast speaking and small voicewomen speak. Her tone is very low, her speech is very slow, and her voice is very small. Please generate a slow female voice. A man who speaks very slowly, but he has a high tone. A woman's voice is so quiet, but the tone is very high.
xvector Please generate a man who speaks very slowly for mehis speaking pitch and volume are very low. A female treble talked normally. Please give me a man low voice that speaks in a hurry and loudly. A man at low speed and low pitch say. A female treble talked normally. Let a girl with a sharp tone fast but the sound is very quiet. Her tone is very low, her speech is very slow, and her voice is very small. Please generate a slow speaking female voice. Ask a boy with louder voice to say slowly. Please generate a slow female voice.
WavLM His sound height is really high, the volume is very large, and he speaks quickly. Please help me generate a woman with normal tone but low speaking volume and slow speaking speed. When a man is very loud but the tone is very low. A man speaks average, a normal tone general tone. High tone key, fast language, normal volume, female. Her sound height is normal, but the speed is fast, the sound is very loud. She said with her bass. Please generate a slow female voice. This man said loudly that his speed is very slow, but the pitch is high. This woman whispered, her speech was very slow, but her voice was very shrilly.
StyleCap w/ GPT-2 (proposed) mel-spectrogram + aggregation module His sound height is really high, the volume is very large, and he speaks quickly. Please generate a slow female voice. A male bass said slowly and loudly. A man speaks average, a normal tone general tone. High tone key, fast language, normal volume, female. Her sound height is normal, but the speed of speech is extremely fast, the sound is very loud. Her unique bass said. Please generate a woman with normal tone but low speaking volume and slow speaking speed for me. This man said loudly that his speed is very slow, but the pitch is high. A woman's voice is so quiet, but the tone is very high.
xvector Please generate a man with a loud voice to say slowly. Please generate a slow speaking female treble. Please generate a faster man low voice for me, his tone is very low. Normal tone, this man said slowly. Please give me a normal tone but the sound is very quiet. Ask a girl with a loud voice to speak quicklythe requirement is that her tone is relatively high. Her tone is very low, her speech is very slow, and her voice is very small. Please generate a slow speaking female voice. Ask a boy with a loud voice to say a word quickly. Please generate a woman with normal tone but low speaking volume and slow speaking speed for me.
WavLM + aggregation module His sound height is really high, the volume is very large, and he speaks quickly. Please generate a woman with normal tone but low speaking volume and slow speaking speed for me. Men who have a loud voice but low tone for you. Ask a boy with a loud voice to say slowly. A woman's voice is so quiet, but the tone is very high. Her voice is normal, but the speed of speech is extremely fast, the sound is very loud. Use her bass to say. Please generate a slow female voice. This man said loudly that his speed is very slow, but the pitch is high. This woman whispered, her speech was very slow, but her voice was very shrilly.
StyleCap w/ Llama2 (proposed) mel-spectrogram + aggregation module His tone is so high, the volume is very large, and he speaks quickly. Please give me a woman with a loud sound and normal tone for me. A man bass said slowly. A man speaks average, a normal tone general tone. A woman's voice is so quiet, but the tone is very high. Her voice is big, but the speed of speech is extremely fast. There is a woman who speaks slowly, with normal volume and low pitch. Please generate a slow speaking female voice. This man said loudly that his speed is very slow, but the pitch is high. A woman's voice is so quiet, but the tone is very high.
xvector His sound height is really high, the volume is very large, and he speaks quickly. A woman's voice is so quiet, but the tone is very high. Please generate a faster man low voice. A man who says a normal voice of normal volume. Please give me a normal tone and a low volume that speaks a fast voicewomen speak. Let a girl with a sharp tone fast but the sound is very quiet. Her tone is very low, her speech is very slow, and her voice is very small. Please generate a slow female voice. Please generate a fast male voice for me, his tone and speed speed are normal. A female treble talked normally.
WavLM + aggregation module His tone is so high, the volume is very large, and he speak very fast. Please give me a woman with normal tone but low speaking volume and slow speaking speed. When a man is very loud but the tone is very low. A man speaks average, a normal tone general tone. Let a girl with a sharp tone fast but the sound is very quiet. Her voice is big, but the speed of speech is extremely fast. She said with her bass. Please generate a slow female voice. This man said loudly that his speed is very slow, but the pitch is high. This woman whispered, her speech was very slow, but her voice was very shrilly.

With sentence rephrasing augmentation

Transformer-based encoder-decoder mel-spectrogram His tone is so high, the volume is very large, and he speak very fast. Please give me a normal tone with fast speaking and small voicewomen speak. A man bass said slowly. A man speaks average, a normal tone general tone. A female treble talked normally. Please give me a normal tone with fast speaking and small voicewomen speak. Her tone is very low, her speech is very slow, and her voice is very small. Please generate a slow speaking female voice. This man said loudly that his speed is very slow, but the pitch is high. A woman's voice is so quiet, but the tone is very high.
xvector Please generate a man who speaks very slowly for mehis speaking pitch and volume are very low. A female treble talked normally. Please generate a faster man low voice. A man who says a normal voice of normal volume. A female treble talked normally. A female treble talked normally. Her tone is very low, her speech is very slow, and her voice is very small. Please generate a slow female voice. His sound height is really high, the volume is very large, and he speaks quickly. Please generate a slow speaking female voice.
WavLM His sound height is really high, the volume is very large, and he speaks quickly. A woman's voice is so quiet, but the tone is very high. A male bass said slowly and loudly. A man speaks average, a normal tone general tone. A woman's voice is so quiet, but the tone is very high. Her voice is big, but the speed of speech is extremely fast. She said with her bass. Please generate a slow female voice. This man said loudly that his speed is very slow, but the pitch is high. The speaking speed of women is very slow, but her voice is very sharp and her sound is quiet.
StyleCap w/ GPT-2 (proposed) mel-spectrogram + aggregation module He speaks loudly, his volume is very high, and he speaks quickly. Please generate a slow speaking female voice. A male bass said slowly and loudly. Normal tone, this man said slowly. A female treble talked normally. Her voice is fast and low, the sound is very loud. Her unique bass said. Please generate a slow speaking female voice. This man said loudly that his speed is very slow, but the pitch is high. The speaking speed of women is very slow, but her voice is very shrilly.
xvector Please generate a man with a loud voice for mehis pitch and volume are very low. Please generate a woman with normal tone but low speaking volume and slow speaking speed. Ask a boy with a loud voice to say a word quickly. Normal tone, this man said slowly. Normal tone, this woman said quickly. Ask a girl with a loud voice to say quicklythe requirement is that she speak in a relatively high tone. Her tone is very low, her speech is very slow, and her voice is very small. Please generate a slow female voice. Ask a boy with a loud voice to say a word quickly. Please generate a woman with normal tone but low speaking volume and slow speaking speed for me.
WavLM + aggregation module His tone is very high, the volume is very large, and he speak quickly. A woman's voice is so quiet, but her speed is very high. When a man has a loud voice but a very low tone to speak. He speaks in a loud voice. Let a girl with a sharp tone fast but the sound is very quiet. Her voice is normal, but the speed is fast. She said with her bass. Please generate a slow speaking female voice. This man said loudly that his speed is very slow, but the pitch is high. This woman whispered, her speech was very slow, but her voice was very shrilly.
StyleCap w/ Llama2 (proposed) mel-spectrogram + aggregation module His tone is so high, the volume is very large, and he speak very fast. Please give me a normal tone with fast speaking and small voicewomen speak. A man bass said slowly. A man who says a normal voice of normal volume. A female treble talked normally. Her sound height is normal, but the speed of speech is extremely fast, but the sound is very loud. A woman who speaks very slowly, her tone is very low, she said loudly. Please generate a slow speaking female voice. This man said loudly that his speed is very slow, but the pitch is high. A woman's voice is so quiet, but the tone is very high.
xvector His sound height is really high, the volume is very large, and he speaks quickly. A woman's voice is so quiet, but the tone is very high. Please generate a faster man low voice. A man who says a normal voice of normal volume. A woman's voice is so quiet, but the tone is very high. A female treble talked normally. Her tone is very low, her speech is very slow, and her voice is very small. Her tone is very low, her speech is very slow, and her voice is very small. Please generate a fast male voice for me, his tone and speed speed are normal. A female treble talked normally.
WavLM + aggregation module His sound height is really high, the volume is very large, and he speaks quickly. A woman's voice is so quiet, but the tone is very high. When a man is very loud but the tone is very low. He speaks loudly. Let a girl with a sharp tone fast but the sound is very quiet. Her voice is big, but the speed of speech is extremely fast. She said with her bass. Please generate a slow speaking female voice. This man said loudly that his speed is very slow, but the pitch is high. The speaking speed of women is very slow, but her voice is very sharp and her sound is quiet.

References