NV-augmented Data Preprocessing
Explore how Affectron builds NV-augmented training data:
Split Verbal/NV Recordings, Retrieve Top-3 Emotion-Aligned NV Candidates, and Create Final Augmented GT.
1️⃣ Splitting Examples Preprocessing
2️⃣ Emotion-Driven Top-3 Matching
3️⃣ Final Augmented GT Examples
NV Augmentation Comparison
Each sample provides four side-by-side audio examples generated with different NV augmentation strategies,
enabling direct preference comparison on the same ground-truth utterance.
[1] H. Wang et al., "CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech,"
arXiv preprint arXiv:2506.02863, 2025.
Zero-Shot NV Model Comparison
All synthesized audio is uniformly downsampled to a 16 kHz sampling rate to ensure fair comparison across models.
♠ denotes a pre-trained model from the official implementation without any training or fine-tuning on the EARS dataset.
♣ denotes a model fine-tuned on the EARS dataset based on the official pre-trained checkpoint.
Nari-labs. "Dia," https://github.com/nari-labs/dia, 2025.
D. Zhihao et al., "CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models," arXiv preprint arXiv:2412.10117, 2024.
D. Zhihao et al., "CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training," arXiv preprint arXiv:2505.17589, 2025.
P. Peng et al., "VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild," In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024.