Computer Science > Computer Vision and Pattern Recognition

arXiv:2511.12449 (cs)

[Submitted on 16 Nov 2025 (v1), last revised 24 Mar 2026 (this version, v2)]

Title:MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding

Authors:Zhanheng Nie, Chenghan Fu, Daoze Zhang, Junxian Wu, Wanxian Guan, Pengjie Wang, Jian Xu, Bo Zheng

Abstract:Recent Multimodal Large Language Models (MLLMs) have significantly advanced e-commerce product understanding. However, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose MOON2.0, a dynamic modality-balanced MultimOdal representation learning framework for e-commerce prOduct uNderstanding. It comprises: (1) a Modality-driven Mixture-of-Experts (MoE) that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further release MBE2.0, a co-augmented Multimodal representation Benchmark for E-commerce representation learning and evaluation at this https URL. Experiments show that MOON2.0 delivers state-of-the-art zero-shot performance on MBE2.0 and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of MOON2.0.

Comments:	11 pages, 7 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:	arXiv:2511.12449 [cs.CV]
	(or arXiv:2511.12449v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2511.12449

Submission history

From: Zhanheng Nie [view email]
[v1] Sun, 16 Nov 2025 04:29:35 UTC (32,824 KB)
[v2] Tue, 24 Mar 2026 02:51:35 UTC (24,659 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators