-
Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale
Authors:
Ansh Nagwekar
Abstract:
Neural network optimization remains one of the most consequential yet poorly understood challenges in modern AI research, where improvements in training algorithms can lead to enhanced feature learning in foundation models, order-of-magnitude reductions in training time, and improved interpretability into how networks learn. While stochastic gradient descent (SGD) and its variants have become the…
▽ More
Neural network optimization remains one of the most consequential yet poorly understood challenges in modern AI research, where improvements in training algorithms can lead to enhanced feature learning in foundation models, order-of-magnitude reductions in training time, and improved interpretability into how networks learn. While stochastic gradient descent (SGD) and its variants have become the de facto standard for training deep networks, their success in these over-parameterized regimes often appears more empirical than principled. This thesis investigates this apparent paradox by tracing the evolution of optimization algorithms from classical first-order methods to modern higher-order techniques, revealing how principled algorithmic design can demystify the training process. Starting from first principles with SGD and adaptive gradient methods, the analysis progressively uncovers the limitations of these conventional approaches when confronted with anisotropy that is representative of real-world data. These breakdowns motivate the exploration of sophisticated alternatives rooted in curvature information: second-order approximation techniques, layer-wise preconditioning, adaptive learning rates, and more. Next, the interplay between these optimization algorithms and the broader neural network training toolkit, which includes prior and recent developments such as maximal update parametrization, learning rate schedules, and exponential moving averages, emerges as equally essential to empirical success. To bridge the gap between theoretical understanding and practical deployment, this paper offers practical prescriptions and implementation strategies for integrating these methods into modern deep learning workflows.
△ Less
Submitted 20 December, 2025;
originally announced December 2025.
-
On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning
Authors:
Thomas T. Zhang,
Behrad Moniri,
Ansh Nagwekar,
Faraz Rahman,
Anton Xue,
Hamed Hassani,
Nikolai Matni
Abstract:
Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their p…
▽ More
Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, linear representation learning and single-index learning, which are widely used to study how typical algorithms efficiently learn useful features to enable generalization. In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim \mathsf{N}(\mathbf{0}, \mathbf{I})$ and well-conditioned settings typically assumed in prior work. We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
Using Game Theory to maximize the chance of victory in two-player sports
Authors:
Ambareesh Ravi,
Atharva Gokhale,
Anchit Nagwekar
Abstract:
Game Theory concepts have been successfully applied in a wide variety of domains over the past decade. Sports and games are one of the popular areas of game theory application owing to its merits and benefits in solving complex scenarios. With recent advancements in technology, the technical and analytical assistance available to players before the match, during game-play and after the match in th…
▽ More
Game Theory concepts have been successfully applied in a wide variety of domains over the past decade. Sports and games are one of the popular areas of game theory application owing to its merits and benefits in solving complex scenarios. With recent advancements in technology, the technical and analytical assistance available to players before the match, during game-play and after the match in the form of post-match analysis for any kind of sport has improved to a great extent. In this paper, we propose three novel approaches towards the development of a tool that can assist the players by providing detailed analysis of optimal decisions so that the player is well prepared with the most appropriate strategy which would produce a favourable result for a given opponent's strategy. We also describe how the system changes when we consider real-time game-play wherein the history of the opponent's strategies in the current rally is also taken into consideration while suggesting.
△ Less
Submitted 24 May, 2021;
originally announced May 2021.