EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim and Seong-Whan Lee
Korea University

Abstract

Emotional text-to-speech (TTS) technology has achieved significant progress in recent years; however, challenges remain owing to the inherent complexity of emotions and limitations of the available emotional speech datasets and models. Previous studies typically relied on limited emotional speech datasets or required extensive manual annotations, restricting their ability to generalize across different speakers and emotional styles. In this paper, we present EmoSphere++, an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech. We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation. Moreover, we propose a multi-level style encoder that can ensure effective generalization for both seen and unseen speakers. We also introduce additional loss functions to enhance the emotion transfer performance for zero-shot scenarios. We employ a conditional flow matching-based decoder to achieve high-quality and expressive emotional TTS in a few sampling steps. Experimental results demonstrate the effectiveness of the proposed framework.



Seen Non-Parallel Style Transfer

[1] H.-S. Oh et al., “Durflex-evc: Duration-flexible emotional voice conversion leveraging discrete representations without text alignment,” IEEE Trans. Affect. Comput., 2025.

[2] G. Zhang et al., “iemotts: Toward robust cross-speaker emotion transfer and control for speech synthesis based on disentanglement between prosody and timbre,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 1693–1705, 2023.

[3] T. Li et al., “Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 1448–1460, 2022.

Unseen Non-Parallel Style Transfer

[1] H.-S. Oh et al., “Durflex-evc: Duration-flexible emotional voice conversion leveraging discrete representations without text alignment,” IEEE Trans. Affect. Comput., 2025.

[2] G. Zhang et al., “iemotts: Toward robust cross-speaker emotion transfer and control for speech synthesis based on disentanglement between prosody and timbre,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 1693–1705, 2023.

[3] T. Li et al., “Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 1448–1460, 2022.

Seen Speaker Emotion Intensity Controllability

Unseen Speaker Emotion Intensity Controllability

Seen Speaker Emotional Style Shift

*Valence: the level of excitement or energy

*Arousal: positivity or negativity of emotion

*Dominance: control level within an emotional state

Unseen Emotional Style Shift

*Valence: the level of excitement or energy

*Arousal: positivity or negativity of emotion

*Dominance: control level within an emotional state

Comparison of Coordinate Transformation