EmoDiffTalk: Emotion-aware Diffusion for Editable 3D Gaussian Talking Head

1 Beijing Normal University
2 Renmin University of China
3 Tencent AI Lab
EmoDiffTalk Teaser Figure

** Figure 1: EmoDiffTalk supports fine-grained speech-driven 3DGS generation and expansive text-based emotion editing.

Abstract

Recent photo-realistic 3D talking head via 3D Gaussian Splatting still has significant shortcoming in emotional expression manipulation, especially for fine-grained and expansive dynamics emotional editing using multi-modal control. This paper introduces a new editable 3D Gaussian talking head, i.e. EmoDiffTalk. Our key idea is a novel Emotion-aware Gaussian Diffusion, which includes an action unit (AU) prompt Gaussian diffusion process for fine-grained facial animator, and moreover an accurate text-to-AU emotion controller to provide accurate and expansive dynamic emotional editing using text input. Experiments on public EmoTalk3D and RenderMe-360 datasets demonstrate superior emotional subtlety, lip-sync fidelity, and controllability of our EmoDiffTalk over previous works.

Demo Video

This video demonstrates our pipeline, comparisons with SOTA methods (Hallo3), and text-driven emotion editing capabilities.

Methodology

Overall Pipeline

Figure 2: The Overall Pipeline. Canonical Gaussian Rig Reconstruction → AU-prompt Gaussian Diffusion → Text-to-AU Emotion Controller.

AU-prompt Gaussian Diffusion

Our model encodes audio signals into Action Unit (AU) codes using a Speech-to-AU Encoder. These codes guide a Transformer-based diffusion model to predict fine-grained facial motion and Gaussian attributes.

Text-to-AU Emotion Controller

This module establishes a direct mapping from textual emotion prompts (e.g., "The person is smiling") to AU activation vectors. It allows for precise, semantic-level emotional editing without retraining the core rendering network.

Comparative Results

Comprehensive quantitative and qualitative evaluation against SOTA methods.

Qualitative Comparison

Identity 1 Comparison Sample 1

Identity 2 Comparison Sample 2

Identity 3 Comparison Sample 3

Quantitative Comparison on EmoTalk3D & RenderMe-360

Method EmoTalk3D RenderMe-360
PSNR ↑ SSIM ↑ LPIPS ↓ LMD ↓ CPBD ↑ PSNR ↑ SSIM ↑ LPIPS ↓ LMD ↓ CPBD ↑
EAMM [24] 9.820.570.4020.250.12 10.960.570.4824.790.09
SadTalker [57] 9.900.640.4743.490.11 12.200.650.4126.800.19
Real3D-Portrait [54] 13.530.720.3024.110.26 16.420.740.2615.350.18
EmoTalk3D [19] 21.220.830.123.620.30 18.440.790.199.980.19
Hallo3 [11] 18.300.780.2418.310.31 20.130.830.149.330.30
EchoMimic [7] 13.800.730.3526.060.21 17.130.710.3318.570.26
Ours 25.780.860.123.560.36 21.410.830.156.590.26

Acknowledgments

This project is funded by the Beijing Municipal Natural Science Foundation Undergraduate Student Research Program (25QY0304)

and assisted by Prof. Hao Zhu's team from Nanjing University.

Citation

@misc{liu2025emodifftalkemotionawarediffusioneditable3d,
      title={EmoDiffTalk:Emotion-aware Diffusion for Editable 3D Gaussian Talking Head}, 
      author={Chang Liu and Tianjiao Jing and Chengcheng Ma and Xuanqi Zhou and Zhengxuan Lian and Qin Jin and Hongliang Yuan and Shi-Sheng Huang},
      year={2025},
      eprint={2512.05991},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.05991}, 
}