Abstract

Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals—and even benefit from expressive text-to-speech (TTS) synthesis—but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Image-based methods rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these challenges, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos—even for unseen extended emotions.

Method Overview

Overview of the proposed Cross-Modal Emotion Transfer (C-MET). (a) We extract input and target embeddings using pretrained audio and visual encoders, and compute the semantic vectors by subtracting the target embeddings from the inputs. (b) During training, we apply contrastive learning between multimodal tokens—both from visual to audio and audio to visual—to align the representation spaces. (c) A multimodal transformer encoder is used to regress the target expression vectors, guided by the speech vectors. The predicted vectors are then added to the input visual embeddings, which are decoded by a pretrained visual decoder to reconstruct the target video from the neutral video.

Qualitative Results

Extended Emotion — Sarcastic

Qualitative comparison for the sarcastic extended emotion. All methods share the same neutral input video from HDTF. Baselines fail to capture the nuanced sarcastic expression, while C-MET successfully conveys it by leveraging cross-modal semantic vectors.

Input Ours (audio) EAT (label) FLOAT (audio) EDTalk (images)

Basic Emotion Editing

Emotion editing results across all basic emotions using a ChatGPT-4o generated identity image. C-MET produces faithful and expressive facial expressions for each emotion category.

Neutral

Angry

Contempt

Disgusted

Fear

Happy

Sad

Surprised

Extended Emotion Editing

Unlike label- or image-based methods, C-MET can generate unseen extended emotions by leveraging expressive TTS-derived speeches (Gemini 2.5). Six extended emotions are demonstrated on a ChatGPT-4o generated identity image.

Neutral

Charismatic

Desirous

Empathetic

Envious

Romantic

Sarcastic

BibTeX

@inproceedings{choi2026cross,
  title={Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video},
  author={Choi, Chanhyuk and Kim, Taesoo and Lee, Donggyu and Jung, Siyeol and Kim, Taehwan},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

Acknowledgements

This work was supported by the Ulsan National Institute of Science and Technology (UNIST). We thank the authors of emotion2vec, PD-FGC, FOMM, IP-LAP, EDTalk, and EmoKnob for releasing their code and models. This website is adapted from the Nerfies project page.