-
Incentivizing Time-Aware Fairness in Data Sharing
Authors:
Jiangwei Chen,
Kieu Thao Nguyen Pham,
Rachael Hwee Ling Sim,
Arun Verma,
Zhaoxuan Wu,
Chuan-Sheng Foo,
Bryan Kian Hsiang Low
Abstract:
In collaborative data sharing and machine learning, multiple parties aggregate their data resources to train a machine learning model with better model performance. However, as the parties incur data collection costs, they are only willing to do so when guaranteed incentives, such as fairness and individual rationality. Existing frameworks assume that all parties join the collaboration simultaneou…
▽ More
In collaborative data sharing and machine learning, multiple parties aggregate their data resources to train a machine learning model with better model performance. However, as the parties incur data collection costs, they are only willing to do so when guaranteed incentives, such as fairness and individual rationality. Existing frameworks assume that all parties join the collaboration simultaneously, which does not hold in many real-world scenarios. Due to the long processing time for data cleaning, difficulty in overcoming legal barriers, or unawareness, the parties may join the collaboration at different times. In this work, we propose the following perspective: As a party who joins earlier incurs higher risk and encourages the contribution from other wait-and-see parties, that party should receive a reward of higher value for sharing data earlier. To this end, we propose a fair and time-aware data sharing framework, including novel time-aware incentives. We develop new methods for deciding reward values to satisfy these incentives. We further illustrate how to generate model rewards that realize the reward values and empirically demonstrate the properties of our methods on synthetic and real-world datasets.
△ Less
Submitted 22 October, 2025; v1 submitted 10 October, 2025;
originally announced October 2025.
-
Uncovering Scaling Laws for Large Language Models via Inverse Problems
Authors:
Arun Verma,
Zhaoxuan Wu,
Zijian Zhou,
Xiaoqiang Lin,
Zhiliang Chen,
Rachael Hwee Ling Sim,
Rui Qiao,
Jingtan Wang,
Nhung Bui,
Xinyuan Niu,
Wenyang Hu,
Gregory Kang Ruey Lau,
Zi-Yu Khoo,
Zitong Zhao,
Xinyi Xu,
Apivich Hemachandra,
See-Kiong Ng,
Bryan Kian Hsiang Low
Abstract:
Large Language Models (LLMs) are large-scale pretrained models that have achieved remarkable success across diverse domains. These successes have been driven by unprecedented complexity and scale in both data and computations. However, due to the high costs of training such models, brute-force trial-and-error approaches to improve LLMs are not feasible. Inspired by the success of inverse problems…
▽ More
Large Language Models (LLMs) are large-scale pretrained models that have achieved remarkable success across diverse domains. These successes have been driven by unprecedented complexity and scale in both data and computations. However, due to the high costs of training such models, brute-force trial-and-error approaches to improve LLMs are not feasible. Inspired by the success of inverse problems in uncovering fundamental scientific laws, this position paper advocates that inverse problems can also efficiently uncover scaling laws that guide the building of LLMs to achieve the desirable performance with significantly better cost-effectiveness.
△ Less
Submitted 9 September, 2025;
originally announced September 2025.
-
WaterDrum: Watermarking for Data-centric Unlearning Metric
Authors:
Xinyang Lu,
Xinyuan Niu,
Gregory Kang Ruey Lau,
Bui Thi Cam Nhung,
Rachael Hwee Ling Sim,
John Russell Himawan,
Fanyu Wen,
Chuan-Sheng Foo,
See-Kiong Ng,
Bryan Kian Hsiang Low
Abstract:
Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users. Existing utility-centric unlearning metrics (based on model utility) may fail to accurately evaluate the extent of unlearning in realistic settings such as when the forget and retain sets have semantically s…
▽ More
Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users. Existing utility-centric unlearning metrics (based on model utility) may fail to accurately evaluate the extent of unlearning in realistic settings such as when the forget and retain sets have semantically similar content and/or retraining the model from scratch on the retain set is impractical. This paper presents the first data-centric unlearning metric for LLMs called WaterDrum that exploits robust text watermarking to overcome these limitations. We introduce new benchmark datasets (with different levels of data similarity) for LLM unlearning that can be used to rigorously evaluate unlearning algorithms via WaterDrum. Our code is available at https://github.com/lululu008/WaterDrum and our new benchmark datasets are released at https://huggingface.co/datasets/Glow-AI/WaterDrum-Ax.
△ Less
Submitted 2 February, 2026; v1 submitted 8 May, 2025;
originally announced May 2025.
-
DUPRE: Data Utility Prediction for Efficient Data Valuation
Authors:
Kieu Thao Nguyen Pham,
Rachael Hwee Ling Sim,
Quoc Phong Nguyen,
See Kiong Ng,
Bryan Kian Hsiang Low
Abstract:
Data valuation is increasingly used in machine learning (ML) to decide the fair compensation for data owners and identify valuable or harmful data for improving ML models. Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility (e.g., validation accuracy) and retraining the ML model for multiple data subsets. While most existing works on efficient e…
▽ More
Data valuation is increasingly used in machine learning (ML) to decide the fair compensation for data owners and identify valuable or harmful data for improving ML models. Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility (e.g., validation accuracy) and retraining the ML model for multiple data subsets. While most existing works on efficient estimation of the Shapley values have focused on reducing the number of subsets to evaluate, our framework, \texttt{DUPRE}, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, \texttt{DUPRE} fits a \emph{Gaussian process} (GP) regression model to predict the utility of every other data subset. Our key contribution lies in the design of our GP kernel based on the sliced Wasserstein distance between empirical data distributions. In particular, we show that the kernel is valid and positive semi-definite, encodes prior knowledge of similarities between different data subsets, and can be efficiently computed. We empirically verify that \texttt{DUPRE} introduces low prediction error and speeds up data valuation for various ML models, datasets, and utility functions.
△ Less
Submitted 22 February, 2025;
originally announced February 2025.
-
On Newton's Method to Unlearn Neural Networks
Authors:
Nhung Bui,
Xinyang Lu,
Rachael Hwee Ling Sim,
See-Kiong Ng,
Bryan Kian Hsiang Low
Abstract:
With the widespread applications of neural networks (NNs) trained on personal data, machine unlearning has become increasingly important for enabling individuals to exercise their personal data ownership, particularly the "right to be forgotten" from trained NNs. Since retraining is computationally expensive, we seek approximate unlearning algorithms for NNs that return identical models to the ret…
▽ More
With the widespread applications of neural networks (NNs) trained on personal data, machine unlearning has become increasingly important for enabling individuals to exercise their personal data ownership, particularly the "right to be forgotten" from trained NNs. Since retraining is computationally expensive, we seek approximate unlearning algorithms for NNs that return identical models to the retrained oracle. While Newton's method has been successfully used to approximately unlearn linear models, we observe that adapting it for NN is challenging due to degenerate Hessians that make computing Newton's update impossible. Additionally, we show that when coupled with popular techniques to resolve the degeneracy, Newton's method often incurs offensively large norm updates and empirically degrades model performance post-unlearning. To address these challenges, we propose CureNewton's method, a principle approach that leverages cubic regularization to handle the Hessian degeneracy effectively. The added regularizer eliminates the need for manual finetuning and affords a natural interpretation within the unlearning context. Experiments across different models and datasets show that our method can achieve competitive unlearning performance to the state-of-the-art algorithm in practical unlearning settings, while being theoretically justified and efficient in running time.
△ Less
Submitted 27 August, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
Data-Centric AI in the Age of Large Language Models
Authors:
Xinyi Xu,
Zhaoxuan Wu,
Rui Qiao,
Arun Verma,
Yao Shu,
Jingtan Wang,
Xinyuan Niu,
Zhenfeng He,
Jiangwei Chen,
Zijian Zhou,
Gregory Kang Ruey Lau,
Hieu Dao,
Lucas Agussurja,
Rachael Hwee Ling Sim,
Xiaoqiang Lin,
Wenyang Hu,
Zhongxiang Dai,
Pang Wei Koh,
Bryan Kian Hsiang Low
Abstract:
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs). We start by making the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs, and yet it receives disproportionally low attention from the research community. We identify four specific…
▽ More
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs). We start by making the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs, and yet it receives disproportionally low attention from the research community. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization. In each scenario, we underscore the importance of data, highlight promising research directions, and articulate the potential impacts on the research community and, where applicable, the society as a whole. For instance, we advocate for a suite of data-centric benchmarks tailored to the scale and complexity of data for LLMs. These benchmarks can be used to develop new data curation methods and document research efforts and results, which can help promote openness and transparency in AI and LLM research.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Incentives in Private Collaborative Machine Learning
Authors:
Rachael Hwee Ling Sim,
Yehong Zhang,
Trong Nghia Hoang,
Xinyi Xu,
Bryan Kian Hsiang Low,
Patrick Jaillet
Abstract:
Collaborative machine learning involves training models on data from multiple parties but must incentivize their participation. Existing data valuation methods fairly value and reward each party based on shared data or model parameters but neglect the privacy risks involved. To address this, we introduce differential privacy (DP) as an incentive. Each party can select its required DP guarantee and…
▽ More
Collaborative machine learning involves training models on data from multiple parties but must incentivize their participation. Existing data valuation methods fairly value and reward each party based on shared data or model parameters but neglect the privacy risks involved. To address this, we introduce differential privacy (DP) as an incentive. Each party can select its required DP guarantee and perturb its sufficient statistic (SS) accordingly. The mediator values the perturbed SS by the Bayesian surprise it elicits about the model parameters. As our valuation function enforces a privacy-valuation trade-off, parties are deterred from selecting excessive DP guarantees that reduce the utility of the grand coalition's model. Finally, the mediator rewards each party with different posterior samples of the model parameters. Such rewards still satisfy existing incentives like fairness but additionally preserve DP and a high similarity to the grand coalition's posterior. We empirically demonstrate the effectiveness and practicality of our approach on synthetic and real-world datasets.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
DeRDaVa: Deletion-Robust Data Valuation for Machine Learning
Authors:
Xiao Tian,
Rachael Hwee Ling Sim,
Jue Fan,
Bryan Kian Hsiang Low
Abstract:
Data valuation is concerned with determining a fair valuation of data from data sources to compensate them or to identify training examples that are the most or least useful for predictions. With the rising interest in personal data ownership and data protection regulations, model owners will likely have to fulfil more data deletion requests. This raises issues that have not been addressed by exis…
▽ More
Data valuation is concerned with determining a fair valuation of data from data sources to compensate them or to identify training examples that are the most or least useful for predictions. With the rising interest in personal data ownership and data protection regulations, model owners will likely have to fulfil more data deletion requests. This raises issues that have not been addressed by existing works: Are the data valuation scores still fair with deletions? Must the scores be expensively recomputed? The answer is no. To avoid recomputations, we propose using our data valuation framework DeRDaVa upfront for valuing each data source's contribution to preserving robust model performance after anticipated data deletions. DeRDaVa can be efficiently approximated and will assign higher values to data that are more useful or less likely to be deleted. We further generalize DeRDaVa to Risk-DeRDaVa to cater to risk-averse/seeking model owners who are concerned with the worst/best-cases model utility. We also empirically demonstrate the practicality of our solutions.
△ Less
Submitted 21 January, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Probably Approximate Shapley Fairness with Applications in Machine Learning
Authors:
Zijian Zhou,
Xinyi Xu,
Rachael Hwee Ling Sim,
Chuan Sheng Foo,
Kian Hsiang Low
Abstract:
The Shapley value (SV) is adopted in various scenarios in machine learning (ML), including data valuation, agent valuation, and feature attribution, as it satisfies their fairness requirements. However, as exact SVs are infeasible to compute in practice, SV estimates are approximated instead. This approximation step raises an important question: do the SV estimates preserve the fairness guarantees…
▽ More
The Shapley value (SV) is adopted in various scenarios in machine learning (ML), including data valuation, agent valuation, and feature attribution, as it satisfies their fairness requirements. However, as exact SVs are infeasible to compute in practice, SV estimates are approximated instead. This approximation step raises an important question: do the SV estimates preserve the fairness guarantees of exact SVs? We observe that the fairness guarantees of exact SVs are too restrictive for SV estimates. Thus, we generalise Shapley fairness to probably approximate Shapley fairness and propose fidelity score, a metric to measure the variation of SV estimates, that determines how probable the fairness guarantees hold. Our last theoretical contribution is a novel greedy active estimation (GAE) algorithm that will maximise the lowest fidelity score and achieve a better fairness guarantee than the de facto Monte-Carlo estimation. We empirically verify GAE outperforms several existing methods in guaranteeing fairness while remaining competitive in estimation accuracy in various ML scenarios using real-world datasets.
△ Less
Submitted 1 December, 2022;
originally announced December 2022.
-
Collaborative Machine Learning with Incentive-Aware Model Rewards
Authors:
Rachael Hwee Ling Sim,
Yehong Zhang,
Mun Choon Chan,
Bryan Kian Hsiang Low
Abstract:
Collaborative machine learning (ML) is an appealing paradigm to build high-quality ML models by training on the aggregated data from many parties. However, these parties are only willing to share their data when given enough incentives, such as a guaranteed fair reward based on their contributions. This motivates the need for measuring a party's contribution and designing an incentive-aware reward…
▽ More
Collaborative machine learning (ML) is an appealing paradigm to build high-quality ML models by training on the aggregated data from many parties. However, these parties are only willing to share their data when given enough incentives, such as a guaranteed fair reward based on their contributions. This motivates the need for measuring a party's contribution and designing an incentive-aware reward scheme accordingly. This paper proposes to value a party's reward based on Shapley value and information gain on model parameters given its data. Subsequently, we give each party a model as a reward. To formally incentivize the collaboration, we define some desirable properties (e.g., fairness and stability) which are inspired by cooperative game theory but adapted for our model reward that is uniquely freely replicable. Then, we propose a novel model reward scheme to satisfy fairness and trade off between the desirable properties via an adjustable parameter. The value of each party's model reward determined by our scheme is attained by injecting Gaussian noise to the aggregated training data with an optimized noise variance. We empirically demonstrate interesting properties of our scheme and evaluate its performance using synthetic and real-world datasets.
△ Less
Submitted 24 October, 2020;
originally announced October 2020.