-
Universal Spectral Tokenization via Self-Supervised Panchromatic Representation Learning
Authors:
Jeff Shen,
Francois Lanusse,
Liam Holden Parker,
Ollie Liu,
Tom Hehir,
Leopoldo Sarra,
Lucas Meyer,
Micah Bowles,
Sebastian Wagner-Carena,
Sebastian Wagner-Carena,
Helen Qu,
Siavash Golkar,
Alberto Bietti,
Hatim Bourfoune,
Nathan Cassereau,
Pierre Cornette,
Keiya Hirashima,
Geraud Krawezik,
Ruben Ohana,
Nicholas Lourie,
Michael McCabe,
Rudy Morel,
Payel Mukhopadhyay,
Mariel Pettee,
Bruno Régaldo-Saint Blancard
, et al. (3 additional authors not shown)
Abstract:
Sequential scientific data span many resolutions and domains, and unifying them into a common representation is a key step toward developing foundation models for the sciences. Astronomical spectra exemplify this challenge: massive surveys have collected millions of spectra across a wide range of wavelengths and resolutions, yet analyses remain fragmented across spectral domains (e.g., optical vs.…
▽ More
Sequential scientific data span many resolutions and domains, and unifying them into a common representation is a key step toward developing foundation models for the sciences. Astronomical spectra exemplify this challenge: massive surveys have collected millions of spectra across a wide range of wavelengths and resolutions, yet analyses remain fragmented across spectral domains (e.g., optical vs. infrared) and object types (e.g., stars vs. galaxies), limiting the ability to pool information across datasets. We present a deep learning model that jointly learns from heterogeneous spectra in a self-supervised manner. Our universal spectral tokenizer processes spectra from a variety of object types and resolutions directly on their native wavelength grids, producing intrinsically aligned, homogeneous, and physically meaningful representations that can be efficiently adapted to achieve competitive performance across a range of downstream tasks. For the first time, we demonstrate that a single model can unify spectral data across resolutions and domains, suggesting that our model can serve as a powerful building block for foundation models in astronomy -- and potentially extend to other scientific domains with heterogeneous sequential data, such as climate and healthcare.
△ Less
Submitted 10 November, 2025; v1 submitted 20 October, 2025;
originally announced October 2025.
-
The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning
Authors:
Ruben Ohana,
Michael McCabe,
Lucas Meyer,
Rudy Morel,
Fruzsina J. Agocs,
Miguel Beneitez,
Marsha Berger,
Blakesley Burkhart,
Keaton Burns,
Stuart B. Dalziel,
Drummond B. Fielding,
Daniel Fortunato,
Jared A. Goldberg,
Keiya Hirashima,
Yan-Fei Jiang,
Rich R. Kerswell,
Suryanarayana Maddu,
Jonah Miller,
Payel Mukhopadhyay,
Stefan S. Nixon,
Jeff Shen,
Romain Watteaux,
Bruno Régaldo-Saint Blancard,
François Rozet,
Liam H. Parker
, et al. (2 additional authors not shown)
Abstract:
Machine learning based surrogate models offer researchers powerful tools for accelerating simulation-based workflows. However, as standard datasets in this space often cover small classes of physical behavior, it can be difficult to evaluate the efficacy of new approaches. To address this gap, we introduce the Well: a large-scale collection of datasets containing numerical simulations of a wide va…
▽ More
Machine learning based surrogate models offer researchers powerful tools for accelerating simulation-based workflows. However, as standard datasets in this space often cover small classes of physical behavior, it can be difficult to evaluate the efficacy of new approaches. To address this gap, we introduce the Well: a large-scale collection of datasets containing numerical simulations of a wide variety of spatiotemporal physical systems. The Well draws from domain experts and numerical software developers to provide 15TB of data across 16 datasets covering diverse domains such as biological systems, fluid dynamics, acoustic scattering, as well as magneto-hydrodynamic simulations of extra-galactic fluids or supernova explosions. These datasets can be used individually or as part of a broader benchmark suite. To facilitate usage of the Well, we provide a unified PyTorch interface for training and evaluating models. We demonstrate the function of this library by introducing example baselines that highlight the new challenges posed by the complex dynamics of the Well. The code and data is available at https://github.com/PolymathicAI/the_well.
△ Less
Submitted 21 February, 2025; v1 submitted 30 November, 2024;
originally announced December 2024.
-
Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task
Authors:
Siavash Golkar,
Alberto Bietti,
Mariel Pettee,
Michael Eickenberg,
Miles Cranmer,
Keiya Hirashima,
Geraud Krawezik,
Nicholas Lourie,
Michael McCabe,
Rudy Morel,
Ruben Ohana,
Liam Holden Parker,
Bruno Régaldo-Saint Blancard,
Kyunghyun Cho,
Shirley Ho
Abstract:
Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datas…
▽ More
Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datasets, akin to object detection or region-based scientific analysis. We present theoretical and empirical analysis using both causal and non-causal Transformer architectures, investigating the influence of various positional encodings on performance and interpretability. In particular, we find that causal attention is much better suited for the task, and that no positional embeddings lead to the best accuracy, though rotary embeddings are competitive and easier to train. We also show that out of distribution performance is tightly linked to which tokens it uses as a bias term.
△ Less
Submitted 30 May, 2024;
originally announced June 2024.
-
Multiple Physics Pretraining for Physical Surrogate Models
Authors:
Michael McCabe,
Bruno Régaldo-Saint Blancard,
Liam Holden Parker,
Ruben Ohana,
Miles Cranmer,
Alberto Bietti,
Michael Eickenberg,
Siavash Golkar,
Geraud Krawezik,
Francois Lanusse,
Mariel Pettee,
Tiberiu Tesileanu,
Kyunghyun Cho,
Shirley Ho
Abstract:
We introduce multiple physics pretraining (MPP), an autoregressive task-agnostic pretraining approach for physical surrogate modeling of spatiotemporal systems with transformers. In MPP, rather than training one model on a specific physical system, we train a backbone model to predict the dynamics of multiple heterogeneous physical systems simultaneously in order to learn features that are broadly…
▽ More
We introduce multiple physics pretraining (MPP), an autoregressive task-agnostic pretraining approach for physical surrogate modeling of spatiotemporal systems with transformers. In MPP, rather than training one model on a specific physical system, we train a backbone model to predict the dynamics of multiple heterogeneous physical systems simultaneously in order to learn features that are broadly useful across systems and facilitate transfer. In order to learn effectively in this setting, we introduce a shared embedding and normalization strategy that projects the fields of multiple systems into a shared embedding space. We validate the efficacy of our approach on both pretraining and downstream tasks over a broad fluid mechanics-oriented benchmark. We show that a single MPP-pretrained transformer is able to match or outperform task-specific baselines on all pretraining sub-tasks without the need for finetuning. For downstream tasks, we demonstrate that finetuning MPP-trained models results in more accurate predictions across multiple time-steps on systems with previously unseen physical components or higher dimensional systems compared to training from scratch or finetuning pretrained video foundation models. We open-source our code and model weights trained at multiple scales for reproducibility.
△ Less
Submitted 10 December, 2024; v1 submitted 4 October, 2023;
originally announced October 2023.