Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations

NV-augmented Data Preprocessing

Explore how Affectron builds NV-augmented training data: Split Verbal/NV Recordings, Retrieve Top-3 Emotion-Aligned NV Candidates, and Create Final Augmented GT.

1️⃣ Splitting Examples Preprocessing

2️⃣ Emotion-Driven Top-3 Matching

3️⃣ Final Augmented GT Examples

NV Augmentation Comparison

Each sample provides four side-by-side audio examples generated with different NV augmentation strategies, enabling direct preference comparison on the same ground-truth utterance.

[1] H. Wang et al., "CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech," arXiv preprint arXiv:2506.02863, 2025.

Seen Speaker Synthesis

*EDNM: emotion-driven top-K NV matching
*EAR: emotion-aware top-K routing
*NSM: NV structural masking
*Augmented GT applies our NV augmentation to the ground truth
P. Peng et al., "VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild." In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024.

Unseen Speaker Synthesis

*EDNM: emotion-driven top-K NV matching
*EAR: emotion-aware top-K routing
*NSM: NV structural masking
*Augmented GT applies our NV augmentation to the ground truth
P. Peng et al., "VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild." In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024.

Zero-Shot NV Model Comparison

All synthesized audio is uniformly downsampled to a 16 kHz sampling rate to ensure fair comparison across models.
denotes a pre-trained model from the official implementation without any training or fine-tuning on the EARS dataset.
denotes a model fine-tuned on the EARS dataset based on the official pre-trained checkpoint.
Nari-labs. "Dia," https://github.com/nari-labs/dia, 2025.
D. Zhihao et al., "CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models," arXiv preprint arXiv:2412.10117, 2024.
D. Zhihao et al., "CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training," arXiv preprint arXiv:2505.17589, 2025.
P. Peng et al., "VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild," In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024.