Skip to main content

Showing 1–1 of 1 results for author: Huaqiu, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2601.22664  [pdf, ps, other

    cs.AI

    Real-Time Aligned Reward Model beyond Semantics

    Authors: Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuefeng Xiao, Hongyan Xie, Li Huaqiu, Songshi Liang, Zhongxiang Dai, Fuzhen Zhuang, Jianxin Li, Yikun Ban, Deqing Wang

    Abstract: Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward overoptimization, in which policy models overfit to the reward model, exploit spurious reward patterns instead of faithfully capturing human intent. Prior mitigations primarily relies on surface semantic information and fails to… ▽ More

    Submitted 9 March, 2026; v1 submitted 30 January, 2026; originally announced January 2026.