Deep Probabilistic Supervision for Image Classification

Authors: Anton Adelöw, Matteo Gamba, Atsuto Maki

Abstract: Supervised training of deep neural networks for classification typically relies on hard targets, which promote overconfidence and can limit calibration, generalization, and robustness. Self-distillation methods aim to mitigate this by leveraging inter-class and sample-specific information present in the model's own predictions, but often remain dependent on hard targets without explicitly modeling… ▽ More Supervised training of deep neural networks for classification typically relies on hard targets, which promote overconfidence and can limit calibration, generalization, and robustness. Self-distillation methods aim to mitigate this by leveraging inter-class and sample-specific information present in the model's own predictions, but often remain dependent on hard targets without explicitly modeling predictive uncertainty. With this in mind, we propose Deep Probabilistic Supervision (DPS), a principled learning framework constructing sample-specific target distributions via statistical inference on the model's own predictions, remaining independent of hard targets after initialization. We show that DPS consistently yields higher test accuracy (e.g., +2.0% for DenseNet-264 on ImageNet) and significantly lower Expected Calibration Error (ECE) (-40% ResNet-50, CIFAR-100) than existing self-distillation methods. When combined with a contrastive loss, DPS achieves state-of-the-art robustness under label noise. △ Less

Submitted 5 February, 2026; v1 submitted 30 December, 2025; originally announced December 2025.

Comments: 16 pages, 12 figures

arXiv:2508.10490 [pdf, ps, other]

On the Complexity-Faithfulness Trade-off of Gradient-Based Explanations

Authors: Amir Mehrpanah, Matteo Gamba, Kevin Smith, Hossein Azizpour

Abstract: ReLU networks, while prevalent for visual data, have sharp transitions, sometimes relying on individual pixels for predictions, making vanilla gradient-based explanations noisy and difficult to interpret. Existing methods, such as GradCAM, smooth these explanations by producing surrogate models at the cost of faithfulness. We introduce a unifying spectral framework to systematically analyze and qu… ▽ More ReLU networks, while prevalent for visual data, have sharp transitions, sometimes relying on individual pixels for predictions, making vanilla gradient-based explanations noisy and difficult to interpret. Existing methods, such as GradCAM, smooth these explanations by producing surrogate models at the cost of faithfulness. We introduce a unifying spectral framework to systematically analyze and quantify smoothness, faithfulness, and their trade-off in explanations. Using this framework, we quantify and regularize the contribution of ReLU networks to high-frequency information, providing a principled approach to identifying this trade-off. Our analysis characterizes how surrogate-based smoothing distorts explanations, leading to an ``explanation gap'' that we formally define and measure for different post-hoc methods. Finally, we validate our theoretical findings across different design choices, datasets, and ablations. △ Less

Submitted 14 August, 2025; originally announced August 2025.

Comments: 23 pages, 14 figures, to be published in International Conference on Computer Vision 2025

arXiv:2502.07783 [pdf, ps, other]

Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Authors: Leyang Hu, Matteo Gamba, Randall Balestriero

Abstract: The scaling of model and data sizes has reshaped the AI landscape, establishing finetuning pretrained models as the standard paradigm for solving downstream tasks. However, dominant finetuning methods typically rely on weight adaptation, often lack interpretability, and depend on heuristically chosen hyperparameters. In this paper, we take a different perspective and shift the focus from weights t… ▽ More The scaling of model and data sizes has reshaped the AI landscape, establishing finetuning pretrained models as the standard paradigm for solving downstream tasks. However, dominant finetuning methods typically rely on weight adaptation, often lack interpretability, and depend on heuristically chosen hyperparameters. In this paper, we take a different perspective and shift the focus from weights to activation functions, viewing them through the lens of spline operators. We propose Curvature Tuning (CT), an interpretable and principled steering method that modulates a model's decision boundary by injecting a single hyperparameter into its activation functions. We show that CT provably adjusts model decision boundary curvature and, more fundamentally, projects a model onto a space of smooth functions-thereby complementing current finetuning methods, whose effect lies primarily in feature adaptation. Making this hyperparameter trainable gives rise to a novel and highly parameter-efficient finetuning method. Empirically, CT improves both generalization and robustness. For example, it boosts downstream accuracy of ResNet-50/152 by 8.59%/8.34% over linear probing and 4.64%/1.70% over LoRA across 12 datasets, and improves robust accuracy on the $\ell_\infty$ benchmark from RobustBench by 1032.64%/1494.46%. Our code is available at https://github.com/Leon-Leyang/curvature-tuning. △ Less

Submitted 15 January, 2026; v1 submitted 11 February, 2025; originally announced February 2025.

Comments: Accepted at NeurIPS 2025

arXiv:2301.12309 [pdf, ps, other]

On the Lipschitz Constant of Deep Networks and Double Descent

Authors: Matteo Gamba, Hossein Azizpour, Mårten Björkman

Abstract: Existing bounds on the generalization error of deep networks assume some form of smooth or bounded dependence on the input variable, falling short of investigating the mechanisms controlling such factors in practice. In this work, we present an extensive experimental study of the empirical Lipschitz constant of deep networks undergoing double descent, and highlight non-monotonic trends strongly co… ▽ More Existing bounds on the generalization error of deep networks assume some form of smooth or bounded dependence on the input variable, falling short of investigating the mechanisms controlling such factors in practice. In this work, we present an extensive experimental study of the empirical Lipschitz constant of deep networks undergoing double descent, and highlight non-monotonic trends strongly correlating with the test error. Building a connection between parameter-space and input-space gradients for SGD around a critical point, we isolate two important factors -- namely loss landscape curvature and distance of parameters from initialization -- respectively controlling optimization dynamics around a critical point and bounding model function complexity, even beyond the training data. Our study presents novels insights on implicit regularization via overparameterization, and effective model complexity for networks trained in practice. △ Less

Submitted 23 July, 2025; v1 submitted 28 January, 2023; originally announced January 2023.

arXiv:2209.10080 [pdf, other]

Deep Double Descent via Smooth Interpolation

Authors: Matteo Gamba, Erik Englesson, Mårten Björkman, Hossein Azizpour

Abstract: The ability of overparameterized deep networks to interpolate noisy data, while at the same time showing good generalization performance, has been recently characterized in terms of the double descent curve for the test error. Common intuition from polynomial regression suggests that overparameterized networks are able to sharply interpolate noisy data, without considerably deviating from the grou… ▽ More The ability of overparameterized deep networks to interpolate noisy data, while at the same time showing good generalization performance, has been recently characterized in terms of the double descent curve for the test error. Common intuition from polynomial regression suggests that overparameterized networks are able to sharply interpolate noisy data, without considerably deviating from the ground-truth signal, thus preserving generalization ability. At present, a precise characterization of the relationship between interpolation and generalization for deep networks is missing. In this work, we quantify sharpness of fit of the training data interpolated by neural network functions, by studying the loss landscape w.r.t. to the input variable locally to each training point, over volumes around cleanly- and noisily-labelled training samples, as we systematically increase the number of model parameters and training epochs. Our findings show that loss sharpness in the input space follows both model- and epoch-wise double descent, with worse peaks observed around noisy labels. While small interpolating models sharply fit both clean and noisy data, large interpolating models express a smooth loss landscape, where noisy targets are predicted over large volumes around training data points, in contrast to existing intuition. △ Less

Submitted 8 April, 2023; v1 submitted 20 September, 2022; originally announced September 2022.

arXiv:2202.11749 [pdf, other]

Are All Linear Regions Created Equal?

Authors: Matteo Gamba, Adrian Chmielewski-Anders, Josephine Sullivan, Hossein Azizpour, Mårten Björkman

Abstract: The number of linear regions has been studied as a proxy of complexity for ReLU networks. However, the empirical success of network compression techniques like pruning and knowledge distillation, suggest that in the overparameterized setting, linear regions density might fail to capture the effective nonlinearity. In this work, we propose an efficient algorithm for discovering linear regions and u… ▽ More The number of linear regions has been studied as a proxy of complexity for ReLU networks. However, the empirical success of network compression techniques like pruning and knowledge distillation, suggest that in the overparameterized setting, linear regions density might fail to capture the effective nonlinearity. In this work, we propose an efficient algorithm for discovering linear regions and use it to investigate the effectiveness of density in capturing the nonlinearity of trained VGGs and ResNets on CIFAR-10 and CIFAR-100. We contrast the results with a more principled nonlinearity measure based on function variation, highlighting the shortcomings of linear regions density. Furthermore, interestingly, our measure of nonlinearity clearly correlates with model-wise deep double descent, connecting reduced test error with reduced nonlinearity, and increased local similarity of linear regions. △ Less

Submitted 23 February, 2022; originally announced February 2022.

arXiv:2003.07797 [pdf, other]

Hyperplane Arrangements of Trained ConvNets Are Biased

Authors: Matteo Gamba, Stefan Carlsson, Hossein Azizpour, Mårten Björkman

Abstract: We investigate the geometric properties of the functions learned by trained ConvNets in the preactivation space of their convolutional layers, by performing an empirical study of hyperplane arrangements induced by a convolutional layer. We introduce statistics over the weights of a trained network to study local arrangements and relate them to the training dynamics. We observe that trained ConvNet… ▽ More We investigate the geometric properties of the functions learned by trained ConvNets in the preactivation space of their convolutional layers, by performing an empirical study of hyperplane arrangements induced by a convolutional layer. We introduce statistics over the weights of a trained network to study local arrangements and relate them to the training dynamics. We observe that trained ConvNets show a significant statistical bias towards regular hyperplane configurations. Furthermore, we find that layers showing biased configurations are critical to validation performance for the architectures considered, trained on CIFAR10, CIFAR100 and ImageNet. △ Less

Submitted 14 April, 2023; v1 submitted 17 March, 2020; originally announced March 2020.

arXiv:1512.09210 [pdf, other]

doi 10.1016/j.jcp.2018.02.041

Galerkin Methods for Boltzmann-Poisson transport with reflection conditions on rough boundaries

Authors: Jose A. Morales Escalante, Irene M. Gamba

Abstract: We consider in this paper the mathematical and numerical modelling of reflective boundary conditions (BC) associated to Boltzmann - Poisson systems, including diffusive reflection in addition to specularity, in the context of electron transport in semiconductor device modelling at nano scales, and their implementation in Discontinuous Galerkin (DG) schemes. We study these BC on the physical bounda… ▽ More We consider in this paper the mathematical and numerical modelling of reflective boundary conditions (BC) associated to Boltzmann - Poisson systems, including diffusive reflection in addition to specularity, in the context of electron transport in semiconductor device modelling at nano scales, and their implementation in Discontinuous Galerkin (DG) schemes. We study these BC on the physical boundaries of the device and develop a numerical approximation to model an insulating boundary condition, or equivalently, a pointwise zero flux mathematical condition for the electron transport equation. Such condition balances the incident and reflective momentum flux at the microscopic level, pointwise at the boundary, in the case of a more general mixed reflection with momentum dependant specularity probability $p(\vec{k})$. We compare the computational prediction of physical observables given by the numerical implementation of these different reflection conditions in our DG scheme for BP models, and observe that the diffusive condition influences the kinetic moments over the whole domain in position space. △ Less

Submitted 26 February, 2018; v1 submitted 30 December, 2015; originally announced December 2015.

Comments: Paper accepted for publication in Journal of Computational Physics. -Conclusions section expanded -Title changed with respect to previous preprint version -New subsections related to simulations of 2D double gated MOSFET and comparison of bulk silicon with collisionless plasma under reflective and periodic boundary conditions

Journal ref: Journal of Computational Physics 363C (2018) pp. 302-328

arXiv:1512.05403 [pdf, other]

doi 10.1016/j.cma.2017.03.003

Discontinuous Galerkin Deterministic Solvers for a Boltzmann-Poisson Model of Hot Electron Transport by Averaged Empirical Pseudopotential Band Structures

Authors: Jose Morales-Escalante, Irene M. Gamba, Yingda Cheng, Armando Majorana, Chi-Wang Shu, James Chelikowsky

Abstract: The purpose of this work is to incorporate numerically, in a discontinuous Galerkin (DG) solver of a Boltzmann-Poisson model for hot electron transport, an electronic conduction band whose values are obtained by the spherical averaging of the full band structure given by a local empirical pseudopotential method (EPM) around a local minimum of the conduction band for silicon, as a midpoint between… ▽ More The purpose of this work is to incorporate numerically, in a discontinuous Galerkin (DG) solver of a Boltzmann-Poisson model for hot electron transport, an electronic conduction band whose values are obtained by the spherical averaging of the full band structure given by a local empirical pseudopotential method (EPM) around a local minimum of the conduction band for silicon, as a midpoint between a radial band model and an anisotropic full band, in order to provide a more accurate physical description of the electron group velocity and conduction energy band structure in a semiconductor. This gives a better quantitative description of the transport and collision phenomena that fundamentally define the behaviour of the Boltzmann - Poisson model for electron transport used in this work. The numerical values of the derivatives of this conduction energy band, needed for the description of the electron group velocity, are obtained by means of a cubic spline interpolation. The EPM-Boltzmann-Poisson transport with this spherically averaged EPM calculated energy surface is numerically simulated and compared to the output of traditional analytic band models such as the parabolic and Kane bands, numerically implemented too, for the case of 1D $n^+-n-n^+$ silicon diodes with 400nm and 50nm channels. Quantitative differences are observed in the kinetic moments related to the conduction energy band used, such as mean velocity, average energy, and electric current (momentum). △ Less

Submitted 17 January, 2018; v1 submitted 16 December, 2015; originally announced December 2015.

Comments: submission to CMAME (Computer Methods in Applied Mechanics and Engineering) Journal as a reply to the reviewers on February 2017

Journal ref: Computer Methods in Applied Mechanics and Engineering, Volume 321, 2017, Pages 209-234

Showing 1–9 of 9 results for author: Gamba, M