-
Characterizing Delusional Spirals through Human-LLM Chat Logs
Authors:
Jared Moore,
Ashish Mehta,
William Agnew,
Jacy Reese Anthis,
Ryan Louie,
Yifan Mai,
Peggy Yin,
Myra Cheng,
Samuel J Paech,
Kevin Klyman,
Stevie Chancellor,
Eric Lin,
Nick Haber,
Desmond C. Ong
Abstract:
As large language models (LLMs) have proliferated, disturbing anecdotal reports of negative psychological effects, such as delusions, self-harm, and ``AI psychosis,'' have emerged in global media and legal discourse. However, it remains unclear how users and chatbots interact over the course of lengthy delusional ``spirals,'' limiting our ability to understand and mitigate the harm. In our work, w…
▽ More
As large language models (LLMs) have proliferated, disturbing anecdotal reports of negative psychological effects, such as delusions, self-harm, and ``AI psychosis,'' have emerged in global media and legal discourse. However, it remains unclear how users and chatbots interact over the course of lengthy delusional ``spirals,'' limiting our ability to understand and mitigate the harm. In our work, we analyze logs of conversations with LLM chatbots from 19 users who report having experienced psychological harms from chatbot use. Many of our participants come from a support group for such chatbot users. We also include chat logs from participants covered by media outlets in widely-distributed stories about chatbot-reinforced delusions. In contrast to prior work that speculates on potential AI harms to mental health, to our knowledge we present the first in-depth study of such high-profile and veridically harmful cases. We develop an inventory of 28 codes and apply it to the $391,562$ messages in the logs. Codes include whether a user demonstrates delusional thinking (15.5% of user messages), a user expresses suicidal thoughts (69 validated user messages), or a chatbot misrepresents itself as sentient (21.2% of chatbot messages). We analyze the co-occurrence of message codes. We find, for example, that messages that declare romantic interest and messages where the chatbot describes itself as sentient occur much more often in longer conversations, suggesting that these topics could promote or result from user over-engagement and that safeguards in these areas may degrade in multi-turn settings. We conclude with concrete recommendations for how policymakers, LLM chatbot developers, and users can use our inventory and conversation analysis tool to understand and mitigate harm from LLM chatbots.
Warning: This paper discusses self-harm, trauma, and violence.
△ Less
Submitted 17 March, 2026;
originally announced March 2026.
-
Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis
Authors:
Penny Chong,
Harshavardhan Abichandani,
Jiyuan Shen,
Atin Ghosh,
Min Pyae Moe,
Yifan Mai,
Daniel Dahlmeier
Abstract:
Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior works each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the development of a unified agent evaluation approach. M…
▽ More
Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior works each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the development of a unified agent evaluation approach. Moreover, they do not systematically account for the user's role nor expertise in the interaction, providing incomplete insights into the agent's performance. We argue that effective agent evaluation goes beyond correctness alone, incorporating conversation quality, efficiency and systematic diagnosis of agent errors. To address this, we introduce the TED framework (Talk, Evaluate, Diagnose). (1) Talk: We leverage reusable, generic expert and non-expert user persona templates for user-agent interaction. (2) Evaluate: We adapt existing datasets by representing subgoals-such as tool signatures, and responses-as natural language grading notes, evaluated automatically with LLM-as-a-judge. We propose new metrics that capture both turn efficiency and intermediate progress of the agent complementing the user-aware setup. (3) Diagnose: We introduce an automated error analysis tool that analyzes the inconsistencies of the judge and agents, uncovering common errors, and providing actionable feedback for agent improvement. We show that our TED framework reveals new insights regarding agent performance across models and user expertise levels. We also demonstrate potential gains in agent performance with peaks of 8-10% on our proposed metrics after incorporating the identified error remedies into the agent's design.
△ Less
Submitted 16 March, 2026;
originally announced March 2026.
-
OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets
Authors:
Jiyuan Shen,
Peiyue Yuan,
Atin Ghosh,
Yifan Mai,
Daniel Dahlmeier
Abstract:
Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline--while simpler--can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a large-scale benchmarking study that evaluates various out…
▽ More
Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline--while simpler--can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction. To examine and explore failure modes, we propose an automated hierarchical error analysis framework that leverages large language models (LLMs) to diagnose error patterns systematically. Our findings suggest that OCR may not be necessary for powerful MLLMs, as image-only input can achieve comparable performance to OCR-enhanced approaches. Moreover, we demonstrate that carefully designed schema, exemplars, and instructions can further enhance MLLMs performance. We hope this work can offer practical guidance and valuable insight for advancing document information extraction.
△ Less
Submitted 3 March, 2026;
originally announced March 2026.
-
Large-scale and local environmental drivers of quenching: tracing H$α$ concentration in X-ray and optical galaxy groups
Authors:
Stefania Barsanti,
Di Wang,
Matthew Colless,
Ang Liu,
Esra Bulbul,
Matt S. Owers,
Scott M. Croom,
Benedetta Vulcani,
Julia J. Bryant,
Yifan Mai,
Sree Oh,
Andrei Ristea,
Sarah M. Sweet,
Jesse van de Sande
Abstract:
To explore the environmental mechanisms causing quenching in nearby star-forming galaxies, we study the variation with local and large-scale environments of a star formation concentration index, C-index $\equiv\log{(r_{50,{\rm H}α}/r_{50,\rm cont}})$, that traces the spatially-resolved distribution of H$α$ emission. Our analysis combines (i) GAMA spectroscopic redshift survey data to optically sel…
▽ More
To explore the environmental mechanisms causing quenching in nearby star-forming galaxies, we study the variation with local and large-scale environments of a star formation concentration index, C-index $\equiv\log{(r_{50,{\rm H}α}/r_{50,\rm cont}})$, that traces the spatially-resolved distribution of H$α$ emission. Our analysis combines (i) GAMA spectroscopic redshift survey data to optically select galaxy groups and reconstruct the cosmic web, (ii) eROSITA data to identify X-ray-emitting groups, and (iii) SAMI Galaxy Survey data to characterise spatially-resolved star formation. We find that galaxies in X-ray+optical groups exhibit the lowest median C-index and the highest fraction of centrally-concentrated star-forming galaxies relative to optical groups and the field (independently of group or stellar mass). Star-forming galaxies in more X-ray luminous groups at fixed dynamical mass show more concentrated star formation. At large scales, nodes show the lowest median C-index and the highest fraction of centrally-concentrated star-forming galaxies relative to filaments and voids, which have similar C-index distributions. C-index correlates most strongly with the distance to the closest node, leaving no significant role for other local or large-scale environment metrics. Finally, regular star-forming galaxies tend to have spins aligned parallel to filaments, consistent with smooth gas accretion, while centrally-concentrated galaxies tend have spins aligned perpendicular to filaments, likely driven by mergers and associated with bulge growth. These results suggest that multi-scale environmental processes, i.e. locally and at large-scale, act to concentrate star formation toward galaxy centres, via gas-related mechanisms in nodes and ram-pressure stripping in X-ray+optical groups.
△ Less
Submitted 16 February, 2026;
originally announced February 2026.
-
The SAMI Galaxy Survey: Quenching of Star Formation in Clusters III. Ram-Pressure-Affected Galaxy Populations
Authors:
Oğuzhan Çakır,
Matt S. Owers,
Luca Cortese,
Mina Pak,
Gabriella Quattropani,
Stefania Barsanti,
Julia J. Bryant,
Warrick J. Couch,
Scott M. Croom,
Pratyush K. Das,
Jon S. Lawrence,
Yifan Mai,
Andrei Ristea,
Sebastian F. Sánchez,
Sarah Sweet,
Jesse van de Sande,
Glenn van de Ven,
Sukyoung K. Yi
Abstract:
Cluster environments influence galaxy evolution by curtailing star formation activity, notably through ram-pressure stripping (RPS). In this study, using spatially resolved spectroscopic data from the SAMI Galaxy Survey, we identify galaxies undergoing or recently affected by RPS in eight nearby clusters ($0.029 < z < 0.058$), through a visual classification scheme based on the ionised gas (…
▽ More
Cluster environments influence galaxy evolution by curtailing star formation activity, notably through ram-pressure stripping (RPS). In this study, using spatially resolved spectroscopic data from the SAMI Galaxy Survey, we identify galaxies undergoing or recently affected by RPS in eight nearby clusters ($0.029 < z < 0.058$), through a visual classification scheme based on the ionised gas ($\rm Hα+ [NII]λ6584$) morphologies, split into unperturbed, asymmetric, and truncated. The projected phase-space analysis shows that asymmetric galaxies are found in a narrow region in cluster-centric distance ($\rm 0.1 < R/R_{200} < 0.6$) and have a larger dispersion in line-of-sight velocity ($σ(|v_{pec}|)_\mathrm{Asym} = 0.71^{+0.09}_{-0.07}\ σ_{200}$) compared to the truncated and unperturbed samples. In terms of star formation activity, RPS candidates yield a much steeper resolved star-forming main sequence (rSFMS; $Σ_\mathrm{SFR} - Σ_\ast$) relation compared to the unperturbed counterparts, primarily emerging from having lower $Σ_\mathrm{SFR}$ values for the low mass density regime, with the steepest gradient deriving from the truncated sample. Moreover, radial star formation profiles reveal that star formation in RPS candidates is suppressed in the outskirts relative to unperturbed galaxies and is more prominent for the truncated sample. In contrast, central ($\rm r/r_{eff}<0.5$) star formation activity in RPS candidates is comparable with that in their unperturbed and field counterparts, suggesting no elevated activity. Taken together, this suggests an evolutionary trend linked to the RPS stage, where unperturbed galaxies likely represent recently accreted systems (pre-RPS), while asymmetric and truncated galaxies may correspond to populations undergoing RPS and post-RPS phases, respectively, favouring outside-in quenching.
△ Less
Submitted 2 February, 2026;
originally announced February 2026.
-
ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models
Authors:
Shir Ashury-Tahan,
Yifan Mai,
Elron Bandel,
Michal Shmueli-Scheuer,
Leshem Choshen
Abstract:
Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rather than weak reasoning. Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement. We introduce ErrorMap, the first method to chart the sources of LLM…
▽ More
Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rather than weak reasoning. Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement. We introduce ErrorMap, the first method to chart the sources of LLM failure. It extracts a model's unique "failure signature", clarifies what benchmarks measure, and broadens error identification to reduce blind spots. This helps developers debug models, aligns benchmark goals with outcomes, and supports informed model selection. ErrorMap works on any model or dataset with the same logic. Applying our method to 35 datasets and 83 models we generate ErrorAtlas, a taxonomy of model errors, revealing recurring failure patterns. ErrorAtlas highlights error types that are currently underexplored in LLM research, such as omissions of required details in the output and question misinterpretation. By shifting focus from where models succeed to why they fail, ErrorMap and ErrorAtlas enable advanced evaluation - one that exposes hidden weaknesses and directs progress. Unlike success, typically measured by task-level metrics, our approach introduces a deeper evaluation layer that can be applied globally across models and tasks, offering richer insights into model behavior and limitations. We make the taxonomy and code publicly available with plans to periodically update ErrorAtlas as new benchmarks and models emerge.
△ Less
Submitted 17 February, 2026; v1 submitted 22 January, 2026;
originally announced January 2026.
-
The MAGPI Survey: co-evolution of baryons and dark matter in star-forming disk-like galaxies at $0.1 \lesssim z \lesssim 0.85$
Authors:
Gauri Sharma,
Andrew J. Battisti,
Emily Wisnioski,
J. Trevor Mendel,
Sabine Bellstedt,
Claudia Del P. Lagos,
Caroline Foster,
Adriano Poci,
Katherine E. Harborne,
Ryan Bagge,
Stefania Barsanti,
Joss Bland-Hawthorn,
Iris Breda,
Scott M. Croom,
Karl Glazebrook,
Yifan Mai,
Sarah M. Sweet,
Sabine Thater,
Lucas M. Valenzuela,
Glenn van de Ven,
Sukyoung Yi,
Tayyaba Zafar,
Bodo Ziegler
Abstract:
We present a comprehensive analysis of the dark matter (DM) content and its structural dependence in star-forming disk-like galaxies at intermediate redshifts ($0.1 \lesssim z \lesssim 0.85$), utilizing spatially resolved kinematic data from the MAGPI survey. We report the following: (1) Low stellar mass galaxies ($M_{\rm star} < 10^{9.5}\, M_\odot$) are strongly DM dominated across all radii, wit…
▽ More
We present a comprehensive analysis of the dark matter (DM) content and its structural dependence in star-forming disk-like galaxies at intermediate redshifts ($0.1 \lesssim z \lesssim 0.85$), utilizing spatially resolved kinematic data from the MAGPI survey. We report the following: (1) Low stellar mass galaxies ($M_{\rm star} < 10^{9.5}\, M_\odot$) are strongly DM dominated across all radii, with average $\langle f_{_{\rm DM}} \rangle \sim 0.85$, while high-mass ($M_{\rm star} > 10^{10.5}\, M_\odot$) systems exhibit relatively low DM fractions in their inner regions ($\langle f_{_{\rm DM}} \rangle \sim 0.47$) which is equivalent to local massive disk galaxies (e.g., Milky Way and Andromeda). This suggests a mass-dependent structural dichotomy, most-likely governed by a combination of internal galactic processes and environmental influences. (2) A tight inverse correlation between $f_{_{\rm DM}}$ and baryon mass surface density ($Σ_{\rm bar}$), with intrinsic scatter of $\sim 0.11$ dex. This is consistent with an inside-out baryon assembly scenario and suggests that the fundamental structural correlations of galaxies were already established by $z\sim 0.85$. (3) No significant evolution in $f_{_{\rm DM}}$ with redshift across the MAGPI window, and when combined with higher-redshift ($0.6 \leq z \leq 1.5$) data from Sharma et al. 2025, we quantitatively show that the reported decline in $f_{_{\rm DM}}(z)$ is most-likely due to observational biases against low-mass systems at $z > 1$. These results offer empirical evidence for a scenario in which disk-like galaxies evolve through a co-regulated build-up of baryonic and DM components, preserving internal structural regularities (such as the total mass distribution and rotation-curve shape) throughout cosmic time.
△ Less
Submitted 18 January, 2026;
originally announced January 2026.
-
Deep-learning-enabled inverse design of large-scale metasurfaces with full-wave accuracy
Authors:
Borui Xu,
Jingzhu Shao,
Xiangyu Zhao,
Haishan Xu,
Yudong Tian,
Nanxi Chen,
Jielin Sun,
Han Lin,
Qiaoliang Bao,
Yiyong Mai,
Chongzhao Wu
Abstract:
Recent advances in meta-optics have enabled diverse functionalities in compact optical devices; however, conventional forward design approaches become inadequate as device complexity and scale grow. Inverse design offers a powerful alternative but often requires massive computational resources and neglects mutual coupling effects. Here, we propose and experimentally validate a deep-learning-enable…
▽ More
Recent advances in meta-optics have enabled diverse functionalities in compact optical devices; however, conventional forward design approaches become inadequate as device complexity and scale grow. Inverse design offers a powerful alternative but often requires massive computational resources and neglects mutual coupling effects. Here, we propose and experimentally validate a deep-learning-enabled framework for rapid inverse design of large-scale, aperiodic metasurfaces with full-wave accuracy.The framework integrates an inverse design network responsible that maps target near-field responses to metasurface geometries in a non-iterative and scalable manner. A lightweight forward prediction network, integrated as a full-wave solver surrogate within the framework, enables efficient end-to-end training of the inverse design network while capturing mutual coupling effects by considering both local and neighboring geometries.The framework's effectiveness is experimentally verified through a multi-foci metalens and a holographic metasurface. This framework enables the inverse design from micrometer to centimeter scales (> 20kλ), with near-field responses discrepancies less than 3% compared to full-wave solvers at subwavelength (< λ/10) resolution.Moreover, it is generalizable to metasurfaces of arbitrary size and operates efficiently without high-performance resources, overcoming the computational bottlenecks of previous inverse design methods.
△ Less
Submitted 14 December, 2025;
originally announced December 2025.
-
The MAGPI Survey: forward modelled gas-phase metallicity gradients in galaxies at $z\sim 0.3$
Authors:
Yifan Mai,
Scott M. Croom,
Emily Wisnioski,
Andrew J. Battisti,
J. Trevor Mendel,
Marcie Mun,
Caroline Foster,
Katherine E. Harborne,
Claudia D. P. Lagos,
Iris Breda,
Tianmu Gao,
Kathryn Grasha,
Tamal Mukherjee,
Adriano Poci,
Rhea-Silvia Remus,
Piyush Sharda,
Sarah M. Sweet,
Sabine Thater,
Lucas M. Valenzuela,
Glenn van de Ven,
Tayyaba Zafar,
Bodo Ziegler
Abstract:
We measure the seeing-deconvolved gas-phase metallicity gradients of 70 star-forming galaxies at $z\sim 0.3$ from the MAGPI survey and investigate their relationship with galaxy properties to understand the mechanisms that influence the distribution of metals and shape the evolution of the galaxies. We use a Bayesian modelling technique, Blobby3D, which accounts for seeing effects (beam smearing)…
▽ More
We measure the seeing-deconvolved gas-phase metallicity gradients of 70 star-forming galaxies at $z\sim 0.3$ from the MAGPI survey and investigate their relationship with galaxy properties to understand the mechanisms that influence the distribution of metals and shape the evolution of the galaxies. We use a Bayesian modelling technique, Blobby3D, which accounts for seeing effects (beam smearing) and can model the substructures of the flux distribution. The median metallicity gradient of our sample is $\nabla \mathrm{[O/H]}=-0.013^{+0.059}_{-0.033}$ dex/kpc. Among the galaxies in our sample, 32.9% have negative metallicity gradients (2$σ$ significance), 10.0% have positive gradients and 57.1% have flat gradients. The $\nabla \mathrm{[O/H]}$-$M_*$ relation of the MAGPI galaxies generally agrees with theoretical predictions, where a combination of stellar feedback, gas transport, and accretion shapes the metallicity profile, with the dominant processes varying with galaxy mass. We find a positive correlation between $\nabla \mathrm{[O/H]}$ and gas velocity dispersion ($r=0.36$), indicating that stronger gas turbulence is associated with flatter or inverted metallicity gradients, likely due to enhanced gas mixing. Additionally, smaller galaxies tend to have flatter or positive gradients, suggesting that metal dilution by gas accretion or removal via feedback-driven winds may outweigh metal enrichment in small galaxies.
△ Less
Submitted 8 December, 2025;
originally announced December 2025.
-
Structured Prompts Improve Evaluation of Language Models
Authors:
Asad Aali,
Muhammad Ahmed Mohsin,
Vasiliki Bikia,
Arnav Singhvi,
Richard Gaus,
Suhana Bedi,
Hejie Cui,
Miguel Fuentes,
Alyssa Unell,
Yifan Mai,
Jordan Cahoon,
Michael Pfeffer,
Roxana Daneshjou,
Sanmi Koyejo,
Emily Alsentzer,
Christopher Potts,
Nigam H. Shah,
Akshay S. Chaudhari
Abstract:
As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks are essential for guiding deployment decisions. In practice, however, frameworks such as Holistic Evaluation of Language Models (HELM) typically evaluate models under a single static prompt configuration, even though model behavior depends strongly on prompt choice. As a result, reported scores c…
▽ More
As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks are essential for guiding deployment decisions. In practice, however, frameworks such as Holistic Evaluation of Language Models (HELM) typically evaluate models under a single static prompt configuration, even though model behavior depends strongly on prompt choice. As a result, reported scores can reflect prompt choice as much as model capability. Declarative prompting frameworks such as DSPy offer a scalable way to evaluate models under a set of structured prompting strategies rather than a static prompt configuration. We present a reproducible DSPy+HELM framework for studying how prompt choice impacts reported benchmark outcomes. Using five prompting methods, we evaluate four frontier and two open-source LMs across seven benchmarks against existing HELM baseline scores. By evaluating LMs across a family of prompt configurations, we find that prompt choice can materially impact leaderboard outcomes. In particular, structured prompting improves performance (by 6% on average), alters comparisons (leaderboard rankings shift on 5/7 benchmarks), with most gains coming from introducing chain-of-thought, and little additional benefit from more advanced optimizers. To our knowledge, this is the first study to systematically integrate structured prompting into an established evaluation framework and quantify how prompt choice alone can impact benchmark conclusions. We open-source (i) DSPy+HELM Evaluation (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).
△ Less
Submitted 1 April, 2026; v1 submitted 25 November, 2025;
originally announced November 2025.
-
findAbar: how astronomers may perceive the bar in galaxies differently
Authors:
Elizabeth J. Iles,
Joss Bland-Hawthorn,
Courtney Crawford,
Scott Croom,
Hillary Davis,
May Gade Pedersen,
Anne Green,
Madusha Gunawardhana,
Miguel Icaza-Lizaola,
Helen Johnston,
Emily F. Kerrison,
Yifan Mai,
Benjamin T. Montet,
Kovi Rose,
Tomas Rutherford,
Manasvee Saraf,
Ellen L. Sirks,
Eckhart Spalding,
Sujeeporn Tuntipong,
Jesse van de Sande,
Pavadol Yamsiri
Abstract:
Bars are ubiquitous morphological features in the observed distribution of galaxies. There are similarly many methods for classifying these features and, without a strict theoretical definition or common standard practice, this is often left to circumstance. So, we were concerned whether astronomers even agree on the bar which they perceive in a given galaxy and whether this could impact perceived…
▽ More
Bars are ubiquitous morphological features in the observed distribution of galaxies. There are similarly many methods for classifying these features and, without a strict theoretical definition or common standard practice, this is often left to circumstance. So, we were concerned whether astronomers even agree on the bar which they perceive in a given galaxy and whether this could impact perceived scientific results. As an elementary test, we twenty-one astronomers with varied experience in studying resolved galaxies and circumstances, have each assessed 200 galaxy images, spanning the early phase of bar evolution in two different barred galaxy simulations. We find variations exist within the classification of all the standard bar parameters assessed: bar length, axis-ratio, pitch-angle and even whether a bar is present at all. If this is indicative of the wider community, it has implications for interpreting morphological trends, such as bar-end effects. Furthermore, we find that it is surprisingly not expertise but gender, followed by career stage, which gives rise to the largest discrepancies in the reported bar parameters. Currently, automation does not seem to be a viable solution, with bar classifications from two automated bar-finding algorithms tested and failing to find bars in snapshots where most astronomers agree a bar must exist. Increasing dependence on machine learning or crowdsourcing with a training dataset can only serve to obfuscate any existing biases if these originate from the specific astronomer producing the training material. On the strength of this small sample, we encourage an interim best practice to reduce the impact of any possible classification bias and set goals for the community to resolve the issue in the future.
△ Less
Submitted 12 November, 2025;
originally announced November 2025.
-
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
Authors:
Sayash Kapoor,
Benedikt Stroebl,
Peter Kirgis,
Nitya Nadgir,
Zachary S Siegel,
Boyi Wei,
Tianci Xue,
Ziru Chen,
Felix Chen,
Saiteja Utpala,
Franck Ndzomga,
Dheeraj Oruganty,
Sophie Luskin,
Kangheng Liu,
Botao Yu,
Amit Arora,
Dongyoon Hahm,
Harsh Trivedi,
Huan Sun,
Juyong Lee,
Tengjun Jin,
Yifan Mai,
Yifei Zhou,
Yuxuan Zhu,
Rishi Bommasani
, et al. (6 additional authors not shown)
Abstract:
AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates paralle…
▽ More
AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.
△ Less
Submitted 13 October, 2025;
originally announced October 2025.
-
Hector Galaxy Survey: Data Processing, Quality Control and Early Science
Authors:
S. Oh,
M. L. P. Gunawardhana,
S. M. Croom,
G. Quattropani,
S. Tuntipong,
J. J. Bryant,
P. Corcho- Caballero,
P. K. Das,
O. Çakır,
J. H. Lee,
A. Ristea,
S. Barsanti,
M. Pak,
S. M. Sweet,
T. J. Woodrow,
T. Rutherford,
Y. Mai,
M. S. Owers,
M. Colless,
L. S. J. Stuart,
H. R. M. Zovaro,
S. P. Vaughan,
J. van de Sande,
T. Farrell,
M. Beom
, et al. (30 additional authors not shown)
Abstract:
The Hector Galaxy Survey is a new optical integral field spectroscopy (IFS) survey currently using the AAT to observe up to 15,000 galaxies at low redshift ($z < 0.1$). The Hector instrument employs 21 optical fibre bundles feeding into two double-beam spectrographs to enable wide-field multi-object IFS observations of galaxies. To efficiently process the survey data, we adopt the data reduction p…
▽ More
The Hector Galaxy Survey is a new optical integral field spectroscopy (IFS) survey currently using the AAT to observe up to 15,000 galaxies at low redshift ($z < 0.1$). The Hector instrument employs 21 optical fibre bundles feeding into two double-beam spectrographs to enable wide-field multi-object IFS observations of galaxies. To efficiently process the survey data, we adopt the data reduction pipeline developed for the SAMI Galaxy Survey, with significant updates to accommodate Hector's dual-spectrograph system. These enhancements address key differences in spectral resolution and other instrumental characteristics relative to SAMI, and are specifically optimised for Hector's unique configuration. We introduce a two-dimensional arc fitting approach that reduces the RMS velocity scatter by a factor of 1.2--3.4 compared to fitting arc lines independently for each fibre. The pipeline also incorporates detailed modelling of chromatic optical distortion in the wide-field corrector, to account for wavelength-dependent spatial shifts across the focal plane. We assess data quality through a series of validation tests, including wavelength solution accuracy, spectral resolution, throughput characterisation, astrometric precision, sky subtraction residuals, and flux calibration stability (4\% systematic offset when compared to Legacy Survey fluxes). We demonstrate that Hector delivers high-fidelity, science-ready datasets, supporting robust measurements of galaxy kinematics, stellar populations, and emission-line properties, and provide examples. Additionally, we address systematic uncertainties identified during the data processing and propose future improvements to enhance the precision and reliability of upcoming data releases. This work establishes a robust data reduction framework for Hector, delivering high-quality data products that support a broad range of extragalactic studies.
△ Less
Submitted 30 September, 2025;
originally announced September 2025.
-
Constructive Conflict-Driven Multi-Agent Reinforcement Learning for Strategic Diversity
Authors:
Yuxiang Mai,
Qiyue Yin,
Wancheng Ni,
Pei Xu,
Kaiqi Huang
Abstract:
In recent years, diversity has emerged as a useful mechanism to enhance the efficiency of multi-agent reinforcement learning (MARL). However, existing methods predominantly focus on designing policies based on individual agent characteristics, often neglecting the interplay and mutual influence among agents during policy formation. To address this gap, we propose Competitive Diversity through Cons…
▽ More
In recent years, diversity has emerged as a useful mechanism to enhance the efficiency of multi-agent reinforcement learning (MARL). However, existing methods predominantly focus on designing policies based on individual agent characteristics, often neglecting the interplay and mutual influence among agents during policy formation. To address this gap, we propose Competitive Diversity through Constructive Conflict (CoDiCon), a novel approach that incorporates competitive incentives into cooperative scenarios to encourage policy exchange and foster strategic diversity among agents. Drawing inspiration from sociological research, which highlights the benefits of moderate competition and constructive conflict in group decision-making, we design an intrinsic reward mechanism using ranking features to introduce competitive motivations. A centralized intrinsic reward module generates and distributes varying reward values to agents, ensuring an effective balance between competition and cooperation. By optimizing the parameterized centralized reward module to maximize environmental rewards, we reformulate the constrained bilevel optimization problem to align with the original task objectives. We evaluate our algorithm against state-of-the-art methods in the SMAC and GRF environments. Experimental results demonstrate that CoDiCon achieves superior performance, with competitive intrinsic rewards effectively promoting diverse and adaptive strategies among cooperative agents.
△ Less
Submitted 25 September, 2025; v1 submitted 16 September, 2025;
originally announced September 2025.
-
AHELM: A Holistic Evaluation of Audio-Language Models
Authors:
Tony Lee,
Haoqin Tu,
Chi Heem Wong,
Zijun Wang,
Siwei Yang,
Yifan Mai,
Yuyin Zhou,
Cihang Xie,
Percy Liang
Abstract:
Evaluations of audio-language models (ALMs) -- multimodal models that take interleaved audio and text as input and output text -- are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models a…
▽ More
Evaluations of audio-language models (ALMs) -- multimodal models that take interleaved audio and text as input and output text -- are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models and use different prompting methods and inference parameters. To address these shortfalls, we introduce AHELM, a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering -- to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. We test 14 open-weight and closed-API ALMs from 3 developers and 3 additional simple baseline systems each consisting of an automatic speech recognizer and a language model. Our results show that while Gemini 2.5 Pro ranks top in 5 out of 10 aspects, it exhibits group unfairness ($p=0.01$) on ASR tasks whereas most of the other models do not. We also find that the baseline systems perform reasonably well on AHELM, with one ranking 6th overall despite having only speech-to-text capabilities. For transparency, all raw prompts, model generations, and outputs are available on our website at https://crfm.stanford.edu/helm/audio/v1.0.0. AHELM is intended to be a living benchmark and new datasets and models will be added over time.
△ Less
Submitted 2 September, 2025; v1 submitted 29 August, 2025;
originally announced August 2025.
-
The Singapore Consensus on Global AI Safety Research Priorities
Authors:
Yoshua Bengio,
Tegan Maharaj,
Luke Ong,
Stuart Russell,
Dawn Song,
Max Tegmark,
Lan Xue,
Ya-Qin Zhang,
Stephen Casper,
Wan Sie Lee,
Sören Mindermann,
Vanessa Wilfred,
Vidhisha Balachandran,
Fazl Barez,
Michael Belinsky,
Imane Bello,
Malo Bourgon,
Mark Brakel,
Siméon Campos,
Duncan Cass-Beggs,
Jiahao Chen,
Rumman Chowdhury,
Kuan Chua Seah,
Jeff Clune,
Juntao Dai
, et al. (63 additional authors not shown)
Abstract:
Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential -- it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash.
The "2025 Singapore Conference on…
▽ More
Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential -- it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash.
The "2025 Singapore Conference on AI (SCAI): International Scientific Exchange on AI Safety" aimed to support research in this space by bringing together AI scientists across geographies to identify and synthesise research priorities in AI safety. This resulting report builds on the International AI Safety Report chaired by Yoshua Bengio and backed by 33 governments. By adopting a defence-in-depth model, this report organises AI safety research domains into three types: challenges with creating trustworthy AI systems (Development), challenges with evaluating their risks (Assessment), and challenges with monitoring and intervening after deployment (Control).
△ Less
Submitted 30 June, 2025; v1 submitted 25 June, 2025;
originally announced June 2025.
-
MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
Authors:
Suhana Bedi,
Hejie Cui,
Miguel Fuentes,
Alyssa Unell,
Michael Wornow,
Juan M. Banda,
Nikesh Kotecha,
Timothy Keyes,
Yifan Mai,
Mert Oez,
Hao Qiu,
Shrey Jain,
Leonardo Schettini,
Mehr Kashyap,
Jason Alan Fries,
Akshay Swaminathan,
Philip Chung,
Fateme Nateghi,
Asad Aali,
Ashwin Nayak,
Shivam Vedak,
Sneha S. Jain,
Birju Patel,
Oluseyi Fayanju,
Shreya Shah
, et al. (56 additional authors not shown)
Abstract:
While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcatego…
▽ More
While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.
△ Less
Submitted 2 June, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
LLMs Judging LLMs: A Simplex Perspective
Authors:
Patrick Vossler,
Fan Xia,
Yifan Mai,
Adarsh Subbaswamy,
Jean Feng
Abstract:
Given the challenge of automatically evaluating free-form outputs from large language models (LLMs), an increasingly common solution is to use LLMs themselves as the judging mechanism, without any gold-standard scores. Implicitly, this practice accounts for only sampling variability (aleatoric uncertainty) and ignores uncertainty about judge quality (epistemic uncertainty). While this is justified…
▽ More
Given the challenge of automatically evaluating free-form outputs from large language models (LLMs), an increasingly common solution is to use LLMs themselves as the judging mechanism, without any gold-standard scores. Implicitly, this practice accounts for only sampling variability (aleatoric uncertainty) and ignores uncertainty about judge quality (epistemic uncertainty). While this is justified if judges are perfectly accurate, it is unclear when such an approach is theoretically valid and practically robust. We study these questions for the task of ranking LLM candidates from a novel geometric perspective: for $M$-level scoring systems, both LLM judges and candidates can be represented as points on an $(M-1)$-dimensional probability simplex, where geometric concepts (e.g., triangle areas) correspond to key ranking concepts. This perspective yields intuitive theoretical conditions and visual proofs for when rankings are identifiable; for instance, we provide a formal basis for the ``folk wisdom'' that LLM judges are more effective for two-level scoring ($M=2$) than multi-level scoring ($M>2$). Leveraging the simplex, we design geometric Bayesian priors that encode epistemic uncertainty about judge quality and vary the priors to conduct sensitivity analyses. Experiments on LLM benchmarks show that rankings based solely on LLM judges are robust in many but not all datasets, underscoring both their widespread success and the need for caution. Our Bayesian method achieves substantially higher coverage rates than existing procedures, highlighting the importance of modeling epistemic uncertainty.
△ Less
Submitted 5 April, 2026; v1 submitted 28 May, 2025;
originally announced May 2025.
-
Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought
Authors:
Tencent Hunyuan Team,
Ao Liu,
Botong Zhou,
Can Xu,
Chayse Zhou,
ChenChen Zhang,
Chengcheng Xu,
Chenhao Wang,
Decheng Wu,
Dengpeng Wu,
Dian Jiao,
Dong Du,
Dong Wang,
Feng Zhang,
Fengzong Lian,
Guanghui Xu,
Guanwei Zhang,
Hai Wang,
Haipeng Luo,
Han Hu,
Huilin Xu,
Jiajia Wu,
Jianchen Zhu,
Jianfeng Yan,
Jiaqi Zhu
, et al. (230 additional authors not shown)
Abstract:
As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid response…
▽ More
As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid responses for simple queries and deep "thinking" modes for complex problems, optimizing computational resources. Architecturally, this 56B activated (560B total) parameter model employs 128 layers (Mamba2, Attention, FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE structure. Pre-trained on 16T high-quality tokens, it supports a 256K context length and is the first industry-deployed large-scale Mamba model. Our comprehensive post-training strategy enhances capabilities via Supervised Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method, Multi-round Deliberation Learning for iterative improvement, and a two-stage Large-scale Reinforcement Learning process targeting STEM and general instruction-following. Evaluations show strong performance: overall top 7 rank on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances high performance and efficiency, offering substantial capabilities at lower inference costs than many reasoning models, establishing a new paradigm for efficient large-scale pre-trained models.
△ Less
Submitted 4 July, 2025; v1 submitted 21 May, 2025;
originally announced May 2025.
-
Code2API: A Tool for Generating Reusable APIs from Stack Overflow Code Snippets
Authors:
Yubo Mai,
Zhipeng Gao,
Xing Hu,
Lingfeng Bao,
Jingyuan Chen,
Jianling Sun
Abstract:
Nowadays, developers often turn to Stack Overflow for solutions to daily problems, however, these code snippets are partial code that cannot be tested and verified properly. One way to test these code snippets is to transform them into APIs (Application Program Interface) that developers can be directly invoked and executed. However, it is often costly and error-prone for developers to manually pe…
▽ More
Nowadays, developers often turn to Stack Overflow for solutions to daily problems, however, these code snippets are partial code that cannot be tested and verified properly. One way to test these code snippets is to transform them into APIs (Application Program Interface) that developers can be directly invoked and executed. However, it is often costly and error-prone for developers to manually perform this transformation (referred to as AIPzation task) due to different actions to be taken (e.g., summarizing proper method names, inferring input parameters list and return statements). To help developers quickly reuse code snippets in Stack Overflow, in this paper, we propose Code2API, a Google Chrome extension that uses Large Language Models (LLMs) to automatically perform APIzation of code snippets on Stack Overflow. \toolname guides LLMs through well-designed prompts to generate reusable APIs, using Chain-of-Thought reasoning and few-shot in-context learning to help LLMs understand and solve the APIzation task in a developer-like manner. The evaluation results show that Code2API significantly outperforms the rule-based approach by a large margin.
△ Less
Submitted 19 April, 2025;
originally announced April 2025.
-
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
Authors:
Shaona Ghosh,
Heather Frase,
Adina Williams,
Sarah Luger,
Paul Röttger,
Fazl Barez,
Sean McGregor,
Kenneth Fricklas,
Mala Kumar,
Quentin Feuillade--Montixi,
Kurt Bollacker,
Felix Friedrich,
Ryan Tsang,
Bertie Vidgen,
Alicia Parrish,
Chris Knotz,
Eleonora Presani,
Jonathan Bennion,
Marisa Ferrara Boston,
Mike Kuniavsky,
Wiebke Hutiri,
James Ezick,
Malek Ben Salem,
Rajat Sahay,
Sujata Goswami
, et al. (77 additional authors not shown)
Abstract:
The rapid advancement and deployment of AI systems have created an urgent need for standard safety-evaluation frameworks. This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. The benchmark evaluates an AI system's resistance…
▽ More
The rapid advancement and deployment of AI systems have created an urgent need for standard safety-evaluation frameworks. This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories, including violent crimes, nonviolent crimes, sex-related crimes, child sexual exploitation, indiscriminate weapons, suicide and self-harm, intellectual property, privacy, defamation, hate, sexual content, and specialized advice (election, financial, health, legal). Our method incorporates a complete assessment standard, extensive prompt datasets, a novel evaluation framework, a grading and reporting system, and the technical as well as organizational infrastructure for long-term support and evolution. In particular, the benchmark employs an understandable five-tier grading scale (Poor to Excellent) and incorporates an innovative entropy-based system-response evaluation.
In addition to unveiling the benchmark, this report also identifies limitations of our method and of building safety benchmarks generally, including evaluator uncertainty and the constraints of single-turn interactions. This work represents a crucial step toward establishing global standards for AI risk and reliability evaluation while acknowledging the need for continued development in areas such as multiturn interactions, multimodal understanding, coverage of additional languages, and emerging hazard categories. Our findings provide valuable insights for model developers, system integrators, and policymakers working to promote safer AI deployment.
△ Less
Submitted 18 April, 2025; v1 submitted 19 February, 2025;
originally announced March 2025.
-
The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Authors:
Shir Ashury-Tahan,
Yifan Mai,
Rajmohan C,
Ariel Gera,
Yotam Perlitz,
Asaf Yehudai,
Elron Bandel,
Leshem Choshen,
Eyal Shnarch,
Percy Liang,
Michal Shmueli-Scheuer
Abstract:
Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of…
▽ More
Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. Although no specific table format leads to consistently better performance, we show that testing over multiple formats is crucial for reliably estimating model capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that table understanding and reasoning tasks remain a significant challenge.
△ Less
Submitted 17 February, 2026; v1 submitted 26 February, 2025;
originally announced February 2025.
-
SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
Authors:
Yosephine Susanto,
Adithya Venkatadri Hulagadri,
Jann Railey Montalan,
Jian Gang Ngui,
Xian Bin Yong,
Weiqi Leong,
Hamsawardhini Rengarajan,
Peerat Limkonchotiwat,
Yifan Mai,
William Chandra Tjhi
Abstract:
With the rapid emergence of novel capabilities in Large Language Models (LLMs), the need for rigorous multilingual and multicultural benchmarks that are integrated has become more pronounced. Though existing LLM benchmarks are capable of evaluating specific capabilities of LLMs in English as well as in various mid- to low-resource languages, including those in the Southeast Asian (SEA) region, a c…
▽ More
With the rapid emergence of novel capabilities in Large Language Models (LLMs), the need for rigorous multilingual and multicultural benchmarks that are integrated has become more pronounced. Though existing LLM benchmarks are capable of evaluating specific capabilities of LLMs in English as well as in various mid- to low-resource languages, including those in the Southeast Asian (SEA) region, a comprehensive and culturally representative evaluation suite for the SEA languages has not been developed thus far. Here, we present SEA-HELM, a holistic linguistic and cultural LLM evaluation suite that emphasises SEA languages, comprising five core pillars: (1) NLP Classics, (2) LLM-specifics, (3) SEA Linguistics, (4) SEA Culture, (5) Safety. SEA-HELM currently supports Filipino, Indonesian, Tamil, Thai, and Vietnamese. We also introduce the SEA-HELM leaderboard, which allows users to understand models' multilingual and multicultural performance in a systematic and user-friendly manner. We make the SEA-HELM evaluation code publicly available.
△ Less
Submitted 2 June, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models
Authors:
Xianfu Cheng,
Wei Zhang,
Shiwei Zhang,
Jian Yang,
Xiangyuan Guan,
Xianjie Wu,
Xiang Li,
Ge Zhang,
Jiaheng Liu,
Yuying Mai,
Yutao Zeng,
Zhoufutu Wen,
Ke Jin,
Baorui Wang,
Weixiao Zhou,
Yunhong Lu,
Tongliang Li,
Wenhao Huang,
Zhoujun Li
Abstract:
The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate the factuality…
▽ More
The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate the factuality ability of MLLMs to answer natural language short questions. SimpleVQA is characterized by six key features: it covers multiple tasks and multiple scenarios, ensures high quality and challenging queries, maintains static and timeless reference answers, and is straightforward to evaluate. Our approach involves categorizing visual question-answering items into 9 different tasks around objective events or common knowledge and situating these within 9 topics. Rigorous quality control processes are implemented to guarantee high-quality, concise, and clear answers, facilitating evaluation with minimal variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into their image comprehension and text generation abilities by identifying and analyzing error cases.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Enhancement of Electric Drive in Silicon Quantum Dots with Electric Quadrupole Spin Resonance
Authors:
Philip Y. Mai,
Pedro H. Pereira,
Lucas Andrade Alonso,
Ross C. C. Leon,
Chih Hwan Yang,
Jason C. C. Hwang,
Daniel Dunmore,
Julien Camirand Lemyre,
Tuomo Tanttu,
Wister Huang,
Kok Wai Chan,
Kuan Yen Tan,
Jesús D. Cifuentes,
Fay E. Hudson,
Kohei M. Itoh,
Arne Laucht,
Michel Pioro-Ladrière,
Christopher C. Escott,
Andrew Dzurak,
Andre Saraiva,
Reinaldo de Melo e Souza,
MengKe Feng
Abstract:
Quantum computation with electron spin qubits requires coherent and efficient manipulation of these spins, typically accomplished through the application of alternating magnetic or electric fields for electron spin resonance (ESR). In particular, electrical driving allows us to apply localized fields on the electrons, which benefits scale-up architectures. However, we have found that Electric Dipo…
▽ More
Quantum computation with electron spin qubits requires coherent and efficient manipulation of these spins, typically accomplished through the application of alternating magnetic or electric fields for electron spin resonance (ESR). In particular, electrical driving allows us to apply localized fields on the electrons, which benefits scale-up architectures. However, we have found that Electric Dipole Spin Resonance (EDSR) is insufficient for modeling the Rabi behavior in recent experimental studies. Therefore, we propose that the electron spin is being driven by a new method of electric spin qubit control which generalizes the spin dynamics by taking into account a quadrupolar contribution of the quantum dot: electric quadrupole spin resonance (EQSR). In this work, we explore the electric quadrupole driving of a quantum dot in silicon, specifically examining the cases of 5 and 13 electron occupancies.
△ Less
Submitted 9 October, 2025; v1 submitted 2 February, 2025;
originally announced February 2025.
-
The MAGPI Survey: radial trends in star formation across different cosmological simulations in comparison with observations at $z \sim$ 0.3
Authors:
Marcie Mun,
Emily Wisnioski,
Katherine E. Harborne,
Claudia D. P. Lagos,
Lucas M. Valenzuela,
Rhea-Silvia Remus,
J. Trevor Mendel,
Andrew J. Battisti,
Sara L. Ellison,
Caroline Foster,
Matias Bravo,
Sarah Brough,
Scott M. Croom,
Tianmu Gao,
Kathryn Grasha,
Anshu Gupta,
Yifan Mai,
Anilkumar Mailvaganam,
Eric G. M. Muller,
Gauri Sharma,
Sarah M. Sweet,
Edward N. Taylor,
Tayyaba Zafar
Abstract:
We investigate the internal and external mechanisms that regulate and quench star formation (SF) in galaxies at $z \sim 0.3$ using MAGPI observations and the EAGLE, Magneticum, and IllustrisTNG cosmological simulations. Using SimSpin to generate mock observations of simulated galaxies, we match detection/resolution limits in star formation rates and stellar mass, along with MAGPI observational det…
▽ More
We investigate the internal and external mechanisms that regulate and quench star formation (SF) in galaxies at $z \sim 0.3$ using MAGPI observations and the EAGLE, Magneticum, and IllustrisTNG cosmological simulations. Using SimSpin to generate mock observations of simulated galaxies, we match detection/resolution limits in star formation rates and stellar mass, along with MAGPI observational details including the average point spread function and pixel scale. While we find a good agreement in the slope of the global star-forming main sequence (SFMS) between MAGPI observations and all three simulations, the slope of the resolved SFMS does not agree within 1 $-$ 2$σ$. Furthermore, in radial SF trends, good agreement between observations and simulations exists only for galaxies far below the SFMS, where we capture evidence for inside-out quenching. The simulations overall agree with each other between $\sim1.5-4 \ R_{\rm e}$ but show varying central suppression within $R \sim 1.5 \ R_{\rm e}$ for galaxies on and below the SFMS, attributable to different AGN feedback prescriptions. All three simulations show similar dependencies of SF radial trends with environment. Central galaxies are subject to both internal and external mechanisms, showing increased SF suppression in the centre with increasing halo mass, indicating AGN feedback. Satellite galaxies display increasing suppression in the outskirts as halo mass increases, indicative of environmental processes. These results demonstrate the power of spatially resolved studies of galaxies; while global properties align, radial profiles reveal discrepancies between observations and simulations and their underlying physics.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
Authors:
Josselin Somerville Roberts,
Tony Lee,
Chi Heem Wong,
Michihiro Yasunaga,
Yifan Mai,
Percy Liang
Abstract:
We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.…
▽ More
We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at https://crfm.stanford.edu/helm/image2struct/v1.0.1/.
△ Less
Submitted 29 October, 2024;
originally announced October 2024.
-
Language model developers should report train-test overlap
Authors:
Andy K Zhang,
Kevin Klyman,
Yifan Mai,
Yoav Levine,
Yian Zhang,
Rishi Bommasani,
Percy Liang
Abstract:
Language models are extensively evaluated, but correctly interpreting evaluation results requires knowledge of train-test overlap which refers to the extent to which the language model is trained on the very data it is being tested on. The public currently lacks adequate information about train-test overlap: most models have no public train-test overlap statistics, and third parties cannot directl…
▽ More
Language models are extensively evaluated, but correctly interpreting evaluation results requires knowledge of train-test overlap which refers to the extent to which the language model is trained on the very data it is being tested on. The public currently lacks adequate information about train-test overlap: most models have no public train-test overlap statistics, and third parties cannot directly measure train-test overlap since they do not have access to the training data. To make this clear, we document the practices of 30 model developers, finding that just 9 developers report train-test overlap: 4 developers release training data under open-source licenses, enabling the community to directly measure train-test overlap, and 5 developers publish their train-test overlap methodology and statistics. By engaging with language model developers, we provide novel information about train-test overlap for three additional developers. Overall, we take the position that language model developers should publish train-test overlap statistics and/or training data whenever they report evaluation results on public test sets. We hope our work increases transparency into train-test overlap to increase the community-wide trust in model evaluations.
△ Less
Submitted 22 July, 2025; v1 submitted 10 October, 2024;
originally announced October 2024.
-
VHELM: A Holistic Evaluation of Vision Language Models
Authors:
Tony Lee,
Haoqin Tu,
Chi Heem Wong,
Wenhao Zheng,
Yiyang Zhou,
Yifan Mai,
Josselin Somerville Roberts,
Michihiro Yasunaga,
Huaxiu Yao,
Cihang Xie,
Percy Liang
Abstract:
Current benchmarks for assessing vision-language models (VLMs) often focus on their perception or problem-solving capabilities and neglect other critical aspects such as fairness, multilinguality, or toxicity. Furthermore, they differ in their evaluation procedures and the scope of the evaluation, making it difficult to compare models. To address these issues, we extend the HELM framework to VLMs…
▽ More
Current benchmarks for assessing vision-language models (VLMs) often focus on their perception or problem-solving capabilities and neglect other critical aspects such as fairness, multilinguality, or toxicity. Furthermore, they differ in their evaluation procedures and the scope of the evaluation, making it difficult to compare models. To address these issues, we extend the HELM framework to VLMs to present the Holistic Evaluation of Vision Language Models (VHELM). VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. In doing so, we produce a comprehensive, multi-dimensional view of the capabilities of the VLMs across these important factors. In addition, we standardize the standard inference parameters, methods of prompting, and evaluation metrics to enable fair comparisons across models. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast. Our initial run evaluates 22 VLMs on 21 existing datasets to provide a holistic snapshot of the models. We uncover new key findings, such as the fact that efficiency-focused models (e.g., Claude 3 Haiku or Gemini 1.5 Flash) perform significantly worse than their full models (e.g., Claude 3 Opus or Gemini 1.5 Pro) on the bias benchmark but not when evaluated on the other aspects. For transparency, we release the raw model generations and complete results on our website (https://crfm.stanford.edu/helm/vhelm/v2.0.1). VHELM is intended to be a living benchmark, and we hope to continue adding new datasets and models over time.
△ Less
Submitted 24 October, 2024; v1 submitted 9 October, 2024;
originally announced October 2024.
-
The MAGPI Survey: the evolution and drivers of gas turbulence in intermediate-redshift galaxies
Authors:
Yifan Mai,
Scott M. Croom,
Emily Wisnioski,
Sam P. Vaughan,
Mathew R. Varidel,
Andrew J. Battisti,
J. Trevor Mendel,
Marcie Mun,
Takafumi Tsukui,
Caroline Foster,
Katherine E. Harborne,
Claudia D. P. Lagos,
Di Wang,
Sabine Bellstedt,
Joss Bland-Hawthorn,
Matthew Colless,
Francesco D'Eugenio,
Kathryn Grasha,
Yingjie Peng,
Giulia Santucci,
Sarah M. Sweet,
Sabine Thater,
Lucas M. Valenzuela,
Bodo Ziegler
Abstract:
We measure the ionised gas velocity dispersions of star-forming galaxies in the MAGPI survey ($z\sim0.3$) and compare them with galaxies in the SAMI ($z\sim0.05$) and KROSS ($z\sim1$) surveys to investigate how the ionised gas velocity dispersion evolves. For the first time, we use a consistent method that forward models galaxy kinematics from $z=0$ to $z=1$. This method accounts for spatial subst…
▽ More
We measure the ionised gas velocity dispersions of star-forming galaxies in the MAGPI survey ($z\sim0.3$) and compare them with galaxies in the SAMI ($z\sim0.05$) and KROSS ($z\sim1$) surveys to investigate how the ionised gas velocity dispersion evolves. For the first time, we use a consistent method that forward models galaxy kinematics from $z=0$ to $z=1$. This method accounts for spatial substructure in emission line flux and beam smearing. We investigate the correlation between gas velocity dispersion and galaxy properties to understand the mechanisms that drive gas turbulence. We find that in both MAGPI and SAMI galaxies, the gas velocity dispersion more strongly correlates with the star-formation rate surface density ($Σ_{\rm SFR}$) than with a variety of other physical properties, and the average gas velocity dispersion is similar, at the same $Σ_{\rm SFR}$, for SAMI, MAGPI and KROSS galaxies. The results indicate that mechanisms related to $Σ_{\rm SFR}$ could be the dominant driver of gas turbulence from $z\sim1$ to $z\sim0$, for example, stellar feedback and/or gravitational instability. The gas velocity dispersion of MAGPI galaxies is also correlated with the non-rotational motion of the gas, illustrating that in addition to star-formation feedback, gas transportation and accretion may also contribute to the gas velocity dispersion for galaxies at $z\sim 0.3$. KROSS galaxies only have a moderate correlation between gas velocity dispersion and $Σ_{\rm SFR}$ and a higher scatter of gas velocity dispersion with respect to $Σ_{\rm SFR}$, in agreement with the suggestion that other mechanisms, such as gas transportation and accretion, are relatively more important at higher redshift galaxies.
△ Less
Submitted 22 August, 2024;
originally announced August 2024.
-
Towards Better Answers: Automated Stack Overflow Post Updating
Authors:
Yubo Mai,
Zhipeng Gao,
Haoye Wang,
Tingting Bi,
Xing Hu,
Xin Xia,
Jianling Sun
Abstract:
Utilizing code snippets on Stack Overflow (SO) is a common practice among developers for problem-solving. Although SO code snippets serve as valuable resources, it is important to acknowledge their imperfections, reusing problematic code snippets can lead to the introduction of suboptimal or buggy code into software projects. SO comments often point out weaknesses of a post and provide valuable in…
▽ More
Utilizing code snippets on Stack Overflow (SO) is a common practice among developers for problem-solving. Although SO code snippets serve as valuable resources, it is important to acknowledge their imperfections, reusing problematic code snippets can lead to the introduction of suboptimal or buggy code into software projects. SO comments often point out weaknesses of a post and provide valuable insights to improve the quality of answers, while SO comments are usually missed and/or ignored, leaving these problematic code snippets untouched. In this work, we first investigate the task of automatic SO posts updating based on their associated comments. We introduce a novel framework, named Soup (Stack Overflow Updator for Post) for this task. Soup addresses two key tasks: Valid Comment-Edit Prediction (VCP) and Automatic Post Updating (APU). Extensive experimental results show the promising performance of our model over a set of benchmarks. Moreover, we also performed an in-the-wild evaluation on Stack Overflow, we submitted 50 edits generated by our approach to Stack Overflow posts and 21 of them have been verified and accepted by SO maintainers, further proving the practical value of Soup.
△ Less
Submitted 17 August, 2024;
originally announced August 2024.
-
AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies
Authors:
Yi Zeng,
Yu Yang,
Andy Zhou,
Jeffrey Ziwei Tan,
Yuheng Tu,
Yifan Mai,
Kevin Klyman,
Minzhou Pan,
Ruoxi Jia,
Dawn Song,
Percy Liang,
Bo Li
Abstract:
Foundation models (FMs) provide societal benefits but also amplify risks. Governments, companies, and researchers have proposed regulatory frameworks, acceptable use policies, and safety benchmarks in response. However, existing public benchmarks often define safety categories based on previous literature, intuitions, or common sense, leading to disjointed sets of categories for risks specified in…
▽ More
Foundation models (FMs) provide societal benefits but also amplify risks. Governments, companies, and researchers have proposed regulatory frameworks, acceptable use policies, and safety benchmarks in response. However, existing public benchmarks often define safety categories based on previous literature, intuitions, or common sense, leading to disjointed sets of categories for risks specified in recent regulations and policies, which makes it challenging to evaluate and compare FMs across these benchmarks. To bridge this gap, we introduce AIR-Bench 2024, the first AI safety benchmark aligned with emerging government regulations and company policies, following the regulation-based safety categories grounded in our AI risks study, AIR 2024. AIR 2024 decomposes 8 government regulations and 16 company policies into a four-tiered safety taxonomy with 314 granular risk categories in the lowest tier. AIR-Bench 2024 contains 5,694 diverse prompts spanning these categories, with manual curation and human auditing to ensure quality. We evaluate leading language models on AIR-Bench 2024, uncovering insights into their alignment with specified safety concerns. By bridging the gap between public benchmarks and practical AI risks, AIR-Bench 2024 provides a foundation for assessing model safety across jurisdictions, fostering the development of safer and more responsible AI systems.
△ Less
Submitted 5 August, 2024; v1 submitted 11 July, 2024;
originally announced July 2024.
-
AutoBencher: Towards Declarative Benchmark Construction
Authors:
Xiang Lisa Li,
Farzaan Kaiyom,
Evan Zheran Liu,
Yifan Mai,
Percy Liang,
Tatsunori Hashimoto
Abstract:
We present AutoBencher, a declarative framework for automatic benchmark construction, and use it to scalably discover novel insights and vulnerabilities of existing language models. Concretely, given a few desiderata of benchmarks (e.g., question difficulty, topic salience), we operationalize each desideratum and cast benchmark creation as an optimization problem. Specifically, we experiment with…
▽ More
We present AutoBencher, a declarative framework for automatic benchmark construction, and use it to scalably discover novel insights and vulnerabilities of existing language models. Concretely, given a few desiderata of benchmarks (e.g., question difficulty, topic salience), we operationalize each desideratum and cast benchmark creation as an optimization problem. Specifically, we experiment with two settings with different optimization objectives: (i) for capability evaluation, we declare the goal of finding a salient, difficult dataset that induces novel performance patterns; (ii) for safety evaluation, we declare the goal of finding a dataset of unsafe prompts that existing LMs fail to decline. To tackle this optimization problem, we use a language model to iteratively propose and refine dataset descriptions, which are then used to generate topic-specific questions and answers. These descriptions are optimized to improve the declared desiderata. We use AutoBencher (powered by GPT-4) to create datasets for math, multilinguality, knowledge, and safety. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that elicit 22% more model errors (i.e., difficulty) than existing benchmarks. On the novelty ends, AutoBencher also helps identify specific gaps not captured by existing benchmarks: e.g., Gemini-Pro has knowledge gaps on Permian Extinction and Fordism while GPT-4o fails to decline harmful requests about cryptocurrency scams.
△ Less
Submitted 28 February, 2025; v1 submitted 11 July, 2024;
originally announced July 2024.
-
Electronic Correlations in Multielectron Silicon Quantum Dots
Authors:
Dylan H. Liang,
MengKe Feng,
Philip Y. Mai,
Jesus D. Cifuentes,
Andrew S. Dzurak,
Andre Saraiva
Abstract:
Silicon quantum computing has the potential to revolutionize technology with capabilities to solve real-life problems that are computationally complex or even intractable for modern computers [1] by offering sufficient high quality qubits to perform complex error-corrected calculations. Silicon metal-oxide-semiconductor based quantum dots present a promising pathway for realizing practical quantum…
▽ More
Silicon quantum computing has the potential to revolutionize technology with capabilities to solve real-life problems that are computationally complex or even intractable for modern computers [1] by offering sufficient high quality qubits to perform complex error-corrected calculations. Silicon metal-oxide-semiconductor based quantum dots present a promising pathway for realizing practical quantum computers. To improve certain qubit properties, it is a common strategy to incorporate multiple electrons in the same dot in order to form qubits in higher confined orbital states. Theoretical modelling is an essential part of understanding the quantum behaviour of these electrons, providing a basis for validating the physical working of device models as well as providing insights into experimental data.
Hartree-Fock theory is an imperative tool for the electronic structure modelling of multi-electron quantum dots due to its ability to simulate a large number of electrons with manageable computation load. However, an efficient calculation of the self-consistent field becomes hard because dot formations in silicon are characterized by strong electron-electron interactions and conduction band valleys, besides the relatively high comparative effective mass, which add to create a behaviour dominated by repulsion between electrons rather than a well established shell structure. In this paper, we present a Hartree-Fock-based method that accounts for these complexities for the modelling of silicon quantum dots. With this method, we first establish the significance of including electron-electron interactions and valley degree of freedom and their implications. We then explore a simple case of anisotropic dots and observe the impact of anisotropy on dot formations.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Are Human Rules Necessary? Generating Reusable APIs with CoT Reasoning and In-Context Learning
Authors:
Yubo Mai,
Zhipeng Gao,
Xing Hu,
Lingfeng Bao,
Yu Liu,
Jianling Sun
Abstract:
Inspired by the great potential of Large Language Models (LLMs) for solving complex coding tasks, in this paper, we propose a novel approach, named Code2API, to automatically perform APIzation for Stack Overflow code snippets. Code2API does not require additional model training or any manual crafting rules and can be easily deployed on personal computers without relying on other external tools. Sp…
▽ More
Inspired by the great potential of Large Language Models (LLMs) for solving complex coding tasks, in this paper, we propose a novel approach, named Code2API, to automatically perform APIzation for Stack Overflow code snippets. Code2API does not require additional model training or any manual crafting rules and can be easily deployed on personal computers without relying on other external tools. Specifically, Code2API guides the LLMs through well-designed prompts to generate well-formed APIs for given code snippets. To elicit knowledge and logical reasoning from LLMs, we used chain-of-thought (CoT) reasoning and few-shot in-context learning, which can help the LLMs fully understand the APIzation task and solve it step by step in a manner similar to a developer. Our evaluations show that Code2API achieves a remarkable accuracy in identifying method parameters (65%) and return statements (66%) equivalent to human-generated ones, surpassing the current state-of-the-art approach, APIzator, by 15.0% and 16.5% respectively. Moreover, compared with APIzator, our user study demonstrates that Code2API exhibits superior performance in generating meaningful method names, even surpassing the human-level performance, and developers are more willing to use APIs generated by our approach, highlighting the applicability of our tool in practice. Finally, we successfully extend our framework to the Python dataset, achieving a comparable performance with Java, which verifies the generalizability of our tool.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Introducing v0.5 of the AI Safety Benchmark from MLCommons
Authors:
Bertie Vidgen,
Adarsh Agrawal,
Ahmed M. Ahmed,
Victor Akinwande,
Namir Al-Nuaimi,
Najla Alfaraj,
Elie Alhajjar,
Lora Aroyo,
Trupti Bavalatti,
Max Bartolo,
Borhane Blili-Hamelin,
Kurt Bollacker,
Rishi Bomassani,
Marisa Ferrara Boston,
Siméon Campos,
Kal Chakra,
Canyu Chen,
Cody Coleman,
Zacharie Delpierre Coudert,
Leon Derczynski,
Debojyoti Dutta,
Ian Eisenberg,
James Ezick,
Heather Frase,
Brian Fuller
, et al. (75 additional authors not shown)
Abstract:
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-pu…
▽ More
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
△ Less
Submitted 13 May, 2024; v1 submitted 18 April, 2024;
originally announced April 2024.
-
SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor
Authors:
Xianfu Cheng,
Weixiao Zhou,
Xiang Li,
Jian Yang,
Hang Zhang,
Tao Sun,
Wei Zhang,
Yuying Mai,
Tongliang Li,
Xiaoming Chen,
Zhoujun Li
Abstract:
Scene Text Recognition (STR) is an important and challenging upstream task for building structured information databases, that involves recognizing text within images of natural scenes. Although current state-of-the-art (SOTA) models for STR exhibit high performance, they typically suffer from low inference efficiency due to their reliance on hybrid architectures comprised of visual encoders and s…
▽ More
Scene Text Recognition (STR) is an important and challenging upstream task for building structured information databases, that involves recognizing text within images of natural scenes. Although current state-of-the-art (SOTA) models for STR exhibit high performance, they typically suffer from low inference efficiency due to their reliance on hybrid architectures comprised of visual encoders and sequence decoders. In this work, we propose a VIsion Permutable extractor for fast and efficient Scene Text Recognition (SVIPTR), which achieves an impressive balance between high performance and rapid inference speeds in the domain of STR. Specifically, SVIPTR leverages a visual-semantic extractor with a pyramid structure, characterized by the Permutation and combination of local and global self-attention layers. This design results in a lightweight and efficient model and its inference is insensitive to input length. Extensive experimental results on various standard datasets for both Chinese and English scene text recognition validate the superiority of SVIPTR. Notably, the SVIPTR-T (Tiny) variant delivers highly competitive accuracy on par with other lightweight models and achieves SOTA inference speeds. Meanwhile, the SVIPTR-L (Large) attains SOTA accuracy in single-encoder-type models, while maintaining a low parameter count and favorable inference speed. Our proposed method provides a compelling solution for the STR challenge, which greatly benefits real-world applications requiring fast and efficient STR. The code is publicly available at https://github.com/cxfyxl/VIPTR.
△ Less
Submitted 19 August, 2024; v1 submitted 18 January, 2024;
originally announced January 2024.
-
Holistic Evaluation of Text-To-Image Models
Authors:
Tony Lee,
Michihiro Yasunaga,
Chenlin Meng,
Yifan Mai,
Joon Sung Park,
Agrim Gupta,
Yunzhi Zhang,
Deepak Narayanan,
Hannah Benita Teufel,
Marco Bellagente,
Minguk Kang,
Taesung Park,
Jure Leskovec,
Jun-Yan Zhu,
Li Fei-Fei,
Jiajun Wu,
Stefano Ermon,
Percy Liang
Abstract:
The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we…
▽ More
The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios encompassing these aspects and evaluate 26 state-of-the-art text-to-image models on this benchmark. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths. We release the generated images and human evaluation results for full transparency at https://crfm.stanford.edu/heim/v1.1.0 and the code at https://github.com/stanford-crfm/helm, which is integrated with the HELM codebase.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
The SAMI Galaxy Survey: impact of black hole activity on galaxy spin-filament alignments
Authors:
Stefania Barsanti,
Matthew Colless,
Francesco D'Eugenio,
Sree Oh,
Julia J. Bryant,
Sarah Casura,
Scott M. Croom,
Yifan Mai,
Andrei Ristea,
Jesse van de Sande,
Charlotte Welker,
Henry R. M. Zovaro
Abstract:
The activity of central supermassive black holes might affect the alignment of galaxy spin axes with respect to the closest cosmic filaments. We exploit the SAMI Galaxy Survey to study possible relations between black hole activity and the spin-filament alignments of stars and ionised gas separately. To explore the impact of instantaneous black hole activity, active galaxies are selected according…
▽ More
The activity of central supermassive black holes might affect the alignment of galaxy spin axes with respect to the closest cosmic filaments. We exploit the SAMI Galaxy Survey to study possible relations between black hole activity and the spin-filament alignments of stars and ionised gas separately. To explore the impact of instantaneous black hole activity, active galaxies are selected according to emission-line diagnostics. Central stellar velocity dispersion ($σ_c$) is used as a proxy for black hole mass and its integrated activity. We find evidence for the gas spin-filament alignments to be influenced by AGN, with Seyfert galaxies showing a stronger perpendicular alignment at fixed bulge mass with respect to galaxies where ionisation is consequence of low-ionizaition nuclear emission-line regions (LINERs) or old stellar populations (retired galaxies). On the other hand, the greater perpendicular tendency for the stellar spin-filament alignments of high-bulge mass galaxies is dominated by retired galaxies. Stellar alignments show a stronger correlation with $σ_c$ compared to the gas alignments. We confirm that bulge mass ($M_{bulge}$) is the primary parameter of correlation for both stellar and gas spin-filament alignments (with no residual dependency left for $σ_c$), while $σ_c$ is the most important property for secular star formation quenching (with no residual dependency left for $M_{bulge}$). These findings indicate that $M_{bulge}$ and $σ_c$ are the most predictive parameters of two different galaxy evolution processes, suggesting mergers trigger spin-filament alignment flips and integrated black hole activity drives star formation quenching.
△ Less
Submitted 6 September, 2023;
originally announced September 2023.
-
Detecting a disk bending wave in a barred-spiral galaxy at redshift 4.4
Authors:
Takafumi Tsukui,
Emily Wisnioski,
Joss Bland-Hawthorn,
Yifan Mai,
Satoru Iguchi,
Junichi Baba,
Ken Freeman
Abstract:
The recent discovery of barred spiral galaxies in the early universe ($z>2$) poses questions of how these structures form and how they influence galaxy evolution in the early universe. In this study, we investigate the morphology and kinematics of the far infrared (FIR) continuum and [CII] emission in BRI1335-0417 at $z\approx 4.4$ from ALMA observations. The variations in position angle and ellip…
▽ More
The recent discovery of barred spiral galaxies in the early universe ($z>2$) poses questions of how these structures form and how they influence galaxy evolution in the early universe. In this study, we investigate the morphology and kinematics of the far infrared (FIR) continuum and [CII] emission in BRI1335-0417 at $z\approx 4.4$ from ALMA observations. The variations in position angle and ellipticity of the isophotes show the characteristic signature of a barred galaxy. The bar, $3.3^{+0.2}_{-0.2}$ kpc long in radius and bridging the previously identified two-armed spiral, is evident in both [CII] and FIR images, driving the galaxy's rapid evolution by channelling gas towards the nucleus. Fourier analysis of the [CII] velocity field reveals an unambiguous kinematic $m=2$ mode with a line-of-sight velocity amplitude of up to $\sim30-40$ km s$^{-1}$; a plausible explanation is the disk's vertical bending mode triggered by external perturbation, which presumably induced the high star formation rate and the bar/spiral structure. The bar identified in [CII] and FIR images of the gas-rich disk galaxy ($\gtrsim 70$\% of the total mass within radius $R\approx 2.2$ disk scale lengths) suggests a new perspective of early bar formation in high redshift gas-rich galaxies -- a gravitationally unstable gas-rich disk creating a star-forming gaseous bar, rather than a stellar bar emerging from a pre-existing stellar disk. This may explain the prevalent bar-like structures seen in FIR images of high-redshift submillimeter galaxies.
△ Less
Submitted 7 December, 2023; v1 submitted 28 August, 2023;
originally announced August 2023.
-
Path integral simulation of exchange interactions in CMOS spin qubits
Authors:
Jesús D. Cifuentes,
Philip Y. Mai,
Frédéric Schlattner,
H. Ekmel Ercan,
MengKe Feng,
Christopher C. Escott,
Andrew S. Dzurak,
Andre Saraiva
Abstract:
The boom of semiconductor quantum computing platforms created a demand for computer-aided design and fabrication of quantum devices. Path integral Monte Carlo (PIMC) can have an important role in this effort because it intrinsically integrates strong quantum correlations that often appear in these multi-electron systems. In this paper we present a PIMC algorithm that estimates exchange interaction…
▽ More
The boom of semiconductor quantum computing platforms created a demand for computer-aided design and fabrication of quantum devices. Path integral Monte Carlo (PIMC) can have an important role in this effort because it intrinsically integrates strong quantum correlations that often appear in these multi-electron systems. In this paper we present a PIMC algorithm that estimates exchange interactions of three-dimensional electrically defined quantum dots. We apply this model to silicon metal-oxide-semiconductor (MOS) devices and we benchmark our method against well-tested full configuration interaction (FCI) simulations. As an application, we study the impact of a single charge trap on two exchanging dots, opening the possibility of using this code to test the tolerance to disorder of CMOS devices. This algorithm provides an accurate description of this system, setting up an initial step to integrate PIMC algorithms into development of semiconductor quantum computers.
△ Less
Submitted 3 August, 2023; v1 submitted 7 July, 2023;
originally announced July 2023.
-
Single Diamond Structured Titania Scaffold
Authors:
Chao Wang,
Congcong Cui,
Quanzheng Deng,
Chong Zhang,
Shunsuke Asahina,
Yuanyuan Cao,
Yiyong Mai,
Shunai Che,
Lu Han
Abstract:
The single diamond (SD) network, discovered in beetle and weevil skeletons, is the 'holy grail' of photonic materials with the widest complete bandgap known to date. However, the thermodynamic instability of SD has made its self-assembly long been a formidable challenge. By imitating the simultaneous co-folding process of nonequilibrium skeleton formation in natural organisms, we devised an unprec…
▽ More
The single diamond (SD) network, discovered in beetle and weevil skeletons, is the 'holy grail' of photonic materials with the widest complete bandgap known to date. However, the thermodynamic instability of SD has made its self-assembly long been a formidable challenge. By imitating the simultaneous co-folding process of nonequilibrium skeleton formation in natural organisms, we devised an unprecedented bottom-up approach to fabricate SD networks via the synergistic self-assembly of diblock copolymer and inorganic precursors and successfully obtained tetrahedral connected polycrystalline anatase SD frameworks. A photonic bandstructure calculation showed that the resulting SD structure has a wide and complete photonic bandgap. This work provides an ingenious design solution to the complex synthetic puzzle and offers new opportunities for biorelevant materials, next-generation optical devices, etc.
△ Less
Submitted 26 July, 2023; v1 submitted 29 June, 2023;
originally announced June 2023.
-
Bounds to electron spin qubit variability for scalable CMOS architectures
Authors:
Jesús D. Cifuentes,
Tuomo Tanttu,
Will Gilbert,
Jonathan Y. Huang,
Ensar Vahapoglu,
Ross C. C. Leon,
Santiago Serrano,
Dennis Otter,
Daniel Dunmore,
Philip Y. Mai,
Frédéric Schlattner,
MengKe Feng,
Kohei Itoh,
Nikolay Abrosimov,
Hans-Joachim Pohl,
Michael Thewalt,
Arne Laucht,
Chih Hwan Yang,
Christopher C. Escott,
Wee Han Lim,
Fay E. Hudson,
Rajib Rahman,
Andrew S. Dzurak,
Andre Saraiva
Abstract:
Spins of electrons in CMOS quantum dots combine exquisite quantum properties and scalable fabrication. In the age of quantum technology, however, the metrics that crowned Si/SiO2 as the microelectronics standard need to be reassessed with respect to their impact upon qubit performance. We chart the spin qubit variability due to the unavoidable atomic-scale roughness of the Si/SiO$_2$ interface, co…
▽ More
Spins of electrons in CMOS quantum dots combine exquisite quantum properties and scalable fabrication. In the age of quantum technology, however, the metrics that crowned Si/SiO2 as the microelectronics standard need to be reassessed with respect to their impact upon qubit performance. We chart the spin qubit variability due to the unavoidable atomic-scale roughness of the Si/SiO$_2$ interface, compiling experiments in 12 devices, and developing theoretical tools to analyse these results. Atomistic tight binding and path integral Monte Carlo methods are adapted for describing fluctuations in devices with millions of atoms by directly analysing their wavefunctions and electron paths instead of their energy spectra. We correlate the effect of roughness with the variability in qubit position, deformation, valley splitting, valley phase, spin-orbit coupling and exchange coupling. These variabilities are found to be bounded and lie within the tolerances for scalable architectures for quantum computing as long as robust control methods are incorporated.
△ Less
Submitted 5 July, 2024; v1 submitted 26 March, 2023;
originally announced March 2023.
-
Holistic Evaluation of Language Models
Authors:
Percy Liang,
Rishi Bommasani,
Tony Lee,
Dimitris Tsipras,
Dilara Soylu,
Michihiro Yasunaga,
Yian Zhang,
Deepak Narayanan,
Yuhuai Wu,
Ananya Kumar,
Benjamin Newman,
Binhang Yuan,
Bobby Yan,
Ce Zhang,
Christian Cosgrove,
Christopher D. Manning,
Christopher Ré,
Diana Acosta-Navas,
Drew A. Hudson,
Eric Zelikman,
Esin Durmus,
Faisal Ladhak,
Frieda Rong,
Hongyu Ren,
Huaxiu Yao
, et al. (25 additional authors not shown)
Abstract:
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest fo…
▽ More
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.
△ Less
Submitted 1 October, 2023; v1 submitted 16 November, 2022;
originally announced November 2022.
-
RIS Design for CRB Optimization in Source Localization with Electromagnetic Interference
Authors:
Yuhua Jiang,
Yuanwan Mai,
Feifei Gao
Abstract:
Reconfigurable Intelligent Surface (RIS) plays an important role in enhancing source localization accuracy. Based on the information inequality of Fisher information analyses, the Cramér-Rao Bound (CRB) of the localization error can be used to evaluate the localization accuracy for a given set of RIS coefficients. In this paper, we adopt the manifold optimization method to derive the optimal RIS c…
▽ More
Reconfigurable Intelligent Surface (RIS) plays an important role in enhancing source localization accuracy. Based on the information inequality of Fisher information analyses, the Cramér-Rao Bound (CRB) of the localization error can be used to evaluate the localization accuracy for a given set of RIS coefficients. In this paper, we adopt the manifold optimization method to derive the optimal RIS coefficients that minimize the CRB of the localization error with the presence of electromagnetic interference (EMI), where the RIS coefficients are restricted to lie on the complex circle manifold. Simulation results are provided to validate the proposed studies under various circumstances.
△ Less
Submitted 15 April, 2023; v1 submitted 1 October, 2022;
originally announced October 2022.
-
The SAMI Galaxy Survey: The relationship between galaxy rotation and the motion of neighbours
Authors:
Yifan Mai,
Sam P. Vaughan,
Scott M. Croom,
Jesse van de Sande,
Stefania Barsanti,
Joss Bland-Hawthorn,
Sarah Brough,
Julia J. Bryant,
Matthew Colless,
Michael Goodwin,
Brent Groves,
Iraklis S. Konstantopoulos,
Jon S. Lawrence,
Nuria P. F. Lorente,
Samuel N. Richards
Abstract:
Using data from the SAMI Galaxy Survey, we investigate the correlation between the projected stellar kinematic spin vector of 1397 SAMI galaxies and the line-of-sight motion of their neighbouring galaxies. We calculate the luminosity-weighted mean velocity difference between SAMI galaxies and their neighbours in the direction perpendicular to the SAMI galaxies angular momentum axes. The luminosity…
▽ More
Using data from the SAMI Galaxy Survey, we investigate the correlation between the projected stellar kinematic spin vector of 1397 SAMI galaxies and the line-of-sight motion of their neighbouring galaxies. We calculate the luminosity-weighted mean velocity difference between SAMI galaxies and their neighbours in the direction perpendicular to the SAMI galaxies angular momentum axes. The luminosity-weighted mean velocity offsets between SAMI and neighbours, which indicates the signal of coherence between the rotation of the SAMI galaxies and the motion of neighbours, is 9.0 $\pm$ 5.4 km s$^{-1}$ (1.7 $σ$) for neighbours within 1 Mpc. In a large-scale analysis, we find that the average velocity offsets increase for neighbours out to 2 Mpc. However, the velocities are consistent with zero or negative for neighbours outside 3 Mpc. The negative signals for neighbours at distance around 10 Mpc are also significant at $\sim 2$ $σ$ level, which indicate that the positive signals within 2 Mpc might come from the variance of large-scale structure. We also calculate average velocities of different subsamples, including galaxies in different regions of the sky, galaxies with different stellar masses, galaxy type, $λ_{Re}$ and inclination. Although low-mass, high-mass, early-type and low-spin galaxies subsamples show 2 - 3 $σ$ signal of coherence for the neighbours within 2 Mpc, the results for different inclination subsamples and large-scale results suggest that the $\sim 2 σ$ signals might result from coincidental scatter or variance of large-scale structure. Overall, the modest evidence of coherence signals for neighbouring galaxies within 2 Mpc needs to be confirmed by larger samples of observations and simulation studies.
△ Less
Submitted 8 July, 2022;
originally announced July 2022.
-
On-demand electrical control of spin qubits
Authors:
Will Gilbert,
Tuomo Tanttu,
Wee Han Lim,
MengKe Feng,
Jonathan Y. Huang,
Jesus D. Cifuentes,
Santiago Serrano,
Philip Y. Mai,
Ross C. C. Leon,
Christopher C. Escott,
Kohei M. Itoh,
Nikolay V. Abrosimov,
Hans-Joachim Pohl,
Michael L. W. Thewalt,
Fay E. Hudson,
Andrea Morello,
Arne Laucht,
Chih Hwan Yang,
Andre Saraiva,
Andrew S. Dzurak
Abstract:
Once called a "classically non-describable two-valuedness" by Pauli , the electron spin is a natural resource for long-lived quantum information since it is mostly impervious to electric fluctuations and can be replicated in large arrays using silicon quantum dots, which offer high-fidelity control. Paradoxically, one of the most convenient control strategies is the integration of nanoscale magnet…
▽ More
Once called a "classically non-describable two-valuedness" by Pauli , the electron spin is a natural resource for long-lived quantum information since it is mostly impervious to electric fluctuations and can be replicated in large arrays using silicon quantum dots, which offer high-fidelity control. Paradoxically, one of the most convenient control strategies is the integration of nanoscale magnets to artificially enhance the coupling between spins and electric field, which in turn hampers the spin's noise immunity and adds architectural complexity. Here we demonstrate a technique that enables a \emph{switchable} interaction between spins and orbital motion of electrons in silicon quantum dots, without the presence of a micromagnet. The naturally weak effects of the relativistic spin-orbit interaction in silicon are enhanced by more than three orders of magnitude by controlling the energy quantisation of electrons in the nanostructure, enhancing the orbital motion. Fast electrical control is demonstrated in multiple devices and electronic configurations, highlighting the utility of the technique. Using the electrical drive we achieve coherence time $T_{2,{\rm Hahn}}\approx50 μ$s, fast single-qubit gates with ${T_{π/2}=3}$ ns and gate fidelities of 99.93 % probed by randomised benchmarking. The higher gate speeds and better compatibility with CMOS manufacturing enabled by on-demand electric control improve the prospects for realising scalable silicon quantum processors.
△ Less
Submitted 18 March, 2022; v1 submitted 17 January, 2022;
originally announced January 2022.
-
XGBoost energy consumption prediction based on multi-system data HVAC
Authors:
Yunlong Li,
Yiming Peng,
Dengzheng Zhang,
Yingan Mai,
Zhengrong Ruan
Abstract:
The energy consumption of the HVAC system accounts for a significant portion of the energy consumption of the public building system, and using an efficient energy consumption prediction model can assist it in carrying out effective energy-saving transformation. Unlike the traditional energy consumption prediction model, this paper extracts features from large data sets using XGBoost, trains them…
▽ More
The energy consumption of the HVAC system accounts for a significant portion of the energy consumption of the public building system, and using an efficient energy consumption prediction model can assist it in carrying out effective energy-saving transformation. Unlike the traditional energy consumption prediction model, this paper extracts features from large data sets using XGBoost, trains them separately to obtain multiple models, then fuses them with LightGBM's independent prediction results using MAE, infers energy consumption related variables, and successfully applies this model to the self-developed Internet of Things platform.
△ Less
Submitted 20 May, 2021;
originally announced May 2021.
-
Experimental Observation of Strong Exciton Effects in Graphene Nanoribbons
Authors:
Alexander Tries,
Silvio Osella,
Pengfei Zhang,
Fugui Xu,
Mathias Kläui,
Yiyong Mai,
David Beljonne,
Hai I. Wang
Abstract:
Graphene nanoribbons (GNRs) with atomically precise width and edge structures are a promising class of nanomaterials for optoelectronics, thanks to their semiconducting nature and high mobility of charge carriers. Understanding the fundamental static optical properties and ultrafast dynamics of charge carrier generation in GNRs is essential for optoelectronic applications. Combining THz spectrosco…
▽ More
Graphene nanoribbons (GNRs) with atomically precise width and edge structures are a promising class of nanomaterials for optoelectronics, thanks to their semiconducting nature and high mobility of charge carriers. Understanding the fundamental static optical properties and ultrafast dynamics of charge carrier generation in GNRs is essential for optoelectronic applications. Combining THz spectroscopy and theoretical calculations, we report a strong exciton effect with binding energy up to 700 meV in liquid-phase-dispersed GNRs with a width of 1.7 nm and an optical bandgap of 1.6 eV, illustrating the intrinsically strong Coulomb interactions between photogenerated electrons and holes. By tracking the exciton dynamics, we reveal an ultrafast formation of excitons in GNRs with a long lifetime over 100 ps. Our results not only reveal fundamental aspects of excitons in GNRs (gigantic binding energy and ultrafast exciton formation etc.), but also highlight promising properties of GNRs for optoelectronic devices.
△ Less
Submitted 14 April, 2020; v1 submitted 11 November, 2019;
originally announced November 2019.
-
Computational Complexity of Hedonic Games on Sparse Graphs
Authors:
Tesshu Hanaka,
Hironori Kiya,
Yasuhide Maei,
Hirotaka Ono
Abstract:
The additively separable hedonic game (ASHG) is a model of coalition formation games on graphs. In this paper, we intensively and extensively investigate the computational complexity of finding several desirable solutions, such as a Nash stable solution, a maximum utilitarian solution, and a maximum egalitarian solution in ASHGs on sparse graphs including bounded-degree graphs, bounded-treewidth g…
▽ More
The additively separable hedonic game (ASHG) is a model of coalition formation games on graphs. In this paper, we intensively and extensively investigate the computational complexity of finding several desirable solutions, such as a Nash stable solution, a maximum utilitarian solution, and a maximum egalitarian solution in ASHGs on sparse graphs including bounded-degree graphs, bounded-treewidth graphs, and near-planar graphs. For example, we show that finding a maximum egalitarian solution is weakly NP-hard even on graphs of treewidth 2, whereas it can be solvable in polynomial time on trees. Moreover, we give a pseudo fixed parameter algorithm when parameterized by treewidth.
△ Less
Submitted 22 October, 2019; v1 submitted 30 August, 2019;
originally announced August 2019.