** Figure 1: EmoDiffTalk supports fine-grained speech-driven 3DGS generation and expansive text-based emotion editing.
Recent photo-realistic 3D talking head via 3D Gaussian Splatting still has significant shortcoming in emotional expression manipulation, especially for fine-grained and expansive dynamics emotional editing using multi-modal control. This paper introduces a new editable 3D Gaussian talking head, i.e. EmoDiffTalk. Our key idea is a novel Emotion-aware Gaussian Diffusion, which includes an action unit (AU) prompt Gaussian diffusion process for fine-grained facial animator, and moreover an accurate text-to-AU emotion controller to provide accurate and expansive dynamic emotional editing using text input. Experiments on public EmoTalk3D and RenderMe-360 datasets demonstrate superior emotional subtlety, lip-sync fidelity, and controllability of our EmoDiffTalk over previous works.
This video demonstrates our pipeline, comparisons with SOTA methods (Hallo3), and text-driven emotion editing capabilities.
Figure 2: The Overall Pipeline. Canonical Gaussian Rig Reconstruction → AU-prompt Gaussian Diffusion → Text-to-AU Emotion Controller.
Our model encodes audio signals into Action Unit (AU) codes using a Speech-to-AU Encoder. These codes guide a Transformer-based diffusion model to predict fine-grained facial motion and Gaussian attributes.
This module establishes a direct mapping from textual emotion prompts (e.g., "The person is smiling") to AU activation vectors. It allows for precise, semantic-level emotional editing without retraining the core rendering network.
Results showing diverse expressions generated from text prompts.
Figure 3: Visualization results of text-guided emotion editing. The model handles prompts like "A sad person with a down-turned mouth" or "A confused person with a deep frown".
Figure 4: Handling nuanced prompts such as "An excited person with a radiant smile" or "A shy person with evasive eyes".
Figure 5: Demonstrating robustness with intense emotions like "A fearful person filled with fear" and "An angry man, fierce and imposing".
Comprehensive quantitative and qualitative evaluation against SOTA methods.
| Method | EmoTalk3D | RenderMe-360 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| PSNR ↑ | SSIM ↑ | LPIPS ↓ | LMD ↓ | CPBD ↑ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | LMD ↓ | CPBD ↑ | |
| EAMM [24] | 9.82 | 0.57 | 0.40 | 20.25 | 0.12 | 10.96 | 0.57 | 0.48 | 24.79 | 0.09 |
| SadTalker [57] | 9.90 | 0.64 | 0.47 | 43.49 | 0.11 | 12.20 | 0.65 | 0.41 | 26.80 | 0.19 |
| Real3D-Portrait [54] | 13.53 | 0.72 | 0.30 | 24.11 | 0.26 | 16.42 | 0.74 | 0.26 | 15.35 | 0.18 |
| EmoTalk3D [19] | 21.22 | 0.83 | 0.12 | 3.62 | 0.30 | 18.44 | 0.79 | 0.19 | 9.98 | 0.19 |
| Hallo3 [11] | 18.30 | 0.78 | 0.24 | 18.31 | 0.31 | 20.13 | 0.83 | 0.14 | 9.33 | 0.30 |
| EchoMimic [7] | 13.80 | 0.73 | 0.35 | 26.06 | 0.21 | 17.13 | 0.71 | 0.33 | 18.57 | 0.26 |
| Ours | 25.78 | 0.86 | 0.12 | 3.56 | 0.36 | 21.41 | 0.83 | 0.15 | 6.59 | 0.26 |
This project is funded by the Beijing Municipal Natural Science Foundation Undergraduate Student Research Program (25QY0304)
and assisted by Prof. Hao Zhu's team from Nanjing University.@misc{liu2025emodifftalkemotionawarediffusioneditable3d,
title={EmoDiffTalk:Emotion-aware Diffusion for Editable 3D Gaussian Talking Head},
author={Chang Liu and Tianjiao Jing and Chengcheng Ma and Xuanqi Zhou and Zhengxuan Lian and Qin Jin and Hongliang Yuan and Shi-Sheng Huang},
year={2025},
eprint={2512.05991},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.05991},
}