EF-TTS:
sHigh-fidelity Text-to-Speech through Conditional Flow Matching and Prosody Modeling with Large Speech Language Models
Haoyu Wang, Jiale Chen, Jiaxun Li, Sizhe Shan, Yuehai Wang
Department of Information and Electronic Engineering, Zhejiang University, China
Submitted to APSIPA 2025
Abstract.
Text-to-speech (TTS) technology has made significant advancements.
However, generating high-quality, controllable emotional speech remains challenging due to the complex nature of emotions and speaker variability.
In this paper, we propose EFTTS, a controllable, zero-shot emotional TTS model.
EFTTS extracts emotional features using a self-supervised emotion representation model and utilizes conditional flow matching to model emotion features with text inputs.
This approach allows for precise emotional control and decouples emotion from other paralinguistic features.
Additionally, we introduce zero-shot emotion transfer to generate emotionally appropriate speech that is closely aligned with the input text and reference speech.
Experimental results show that EFTTS outperforms existing methods in terms of emotional expressiveness, naturalness, and synthesis quality,
offering a promising solution for high-quality, controllable emotional speech synthesis.
Overview
Figure 1: Training diagram of our model. The pitch extractor, PL-bert, and ASR modules are frozen during training. The emotion encoder extracts the utterance-level features of speech.
Figure 2: Inference diagram of the EFTTS.
Comparison with EmoDiff
Text | Emotion | EmoDiff | Ours |
---|---|---|---|
Her shoes were like fishes. | Angry | ||
They found a cow grazing in a field. | Happy | ||
And what are doves? And what are doves? | Sad | ||
Suppose I take grandmother a fresh vegetable? | Surprise |
Comparison with EmoSphere-TTS
Text | Emotion | EmoSphere | Ours |
---|---|---|---|
She may mind ye of her. | Angry | ||
She is now choosing skirt to wear. | Happy | ||
Story twenty nine a boy and a monkey. | Sad | ||
The nastiest things they saw were the cobwebs. | Surprise |
Controllable Emotional Speech Synthesis
Text | Ground Truth | Synthesized | Emotion |
---|---|---|---|
Poor Tom now is dead! | Neutral | ||
Then sadly it is much farther. | Angry | ||
Then there was a report. | Happy | ||
Andy what's the gyre and to gimble. | Sad | ||
But what about this thing, sticky! | Surprise |
Zero-shot Emotion Transfer
Text | Ground Truth | Synthesized | Emotion | Referance Audio |
---|---|---|---|---|
Ask god to help you. | Angry | |||
it looks much better. | ||||
Why should i care though David's lips were twitching? | ||||
He searched through the box. | Happy | |||
Fear neither root nor sprout! | ||||
Clear than clear water! | ||||
All smile were real and the happier the more sincere . | Neutral | |||
This speech roused dame ilse to anger. | ||||
What are you waiting for? man. | ||||
I say neither yea nor nay. | Sad | |||
I don't painted tiger. | ||||
It must come sometimes to jam a day. | ||||
To buy two pork chops. | Surprise | |||
The first year they sowed rye. | ||||
An hour out of Guildford town. |