EF-TTS:

s

High-fidelity Text-to-Speech through Conditional Flow Matching and Prosody Modeling with Large Speech Language Models

Haoyu Wang, Jiale Chen, Jiaxun Li, Sizhe Shan, Yuehai Wang

Department of Information and Electronic Engineering, Zhejiang University, China

Submitted to APSIPA 2025

Abstract. Text-to-speech (TTS) technology has made significant advancements. However, generating high-quality, controllable emotional speech remains challenging due to the complex nature of emotions and speaker variability. In this paper, we propose EFTTS, a controllable, zero-shot emotional TTS model. EFTTS extracts emotional features using a self-supervised emotion representation model and utilizes conditional flow matching to model emotion features with text inputs. This approach allows for precise emotional control and decouples emotion from other paralinguistic features. Additionally, we introduce zero-shot emotion transfer to generate emotionally appropriate speech that is closely aligned with the input text and reference speech. Experimental results show that EFTTS outperforms existing methods in terms of emotional expressiveness, naturalness, and synthesis quality, offering a promising solution for high-quality, controllable emotional speech synthesis.

Overview

Figure 1: Training diagram of our model. The pitch extractor, PL-bert, and ASR modules are frozen during training. The emotion encoder extracts the utterance-level features of speech.

Figure 2: Inference diagram of the EFTTS.

Comparison with EmoDiff

Text Emotion EmoDiff Ours
Her shoes were like fishes. Angry
They found a cow grazing in a field. Happy
And what are doves? And what are doves? Sad
Suppose I take grandmother a fresh vegetable? Surprise

Comparison with EmoSphere-TTS

Text Emotion EmoSphere Ours
She may mind ye of her. Angry
She is now choosing skirt to wear. Happy
Story twenty nine a boy and a monkey. Sad
The nastiest things they saw were the cobwebs. Surprise

Controllable Emotional Speech Synthesis

Text Ground Truth Synthesized Emotion
Poor Tom now is dead! Neutral
Then sadly it is much farther. Angry
Then there was a report. Happy
Andy what's the gyre and to gimble. Sad
But what about this thing, sticky! Surprise

Zero-shot Emotion Transfer

Text Ground Truth Synthesized Emotion Referance Audio
Ask god to help you. Angry
it looks much better.
Why should i care though David's lips were twitching?
He searched through the box. Happy
Fear neither root nor sprout!
Clear than clear water!
All smile were real and the happier the more sincere . Neutral
This speech roused dame ilse to anger.
What are you waiting for? man.
I say neither yea nor nay. Sad
I don't painted tiger.
It must come sometimes to jam a day.
To buy two pork chops. Surprise
The first year they sowed rye.
An hour out of Guildford town.