Qualitative comparison for the sarcastic extended emotion. All methods share the same neutral input video from HDTF. Baselines fail to capture the nuanced sarcastic expression, while C-MET successfully conveys it by leveraging cross-modal semantic vectors.