-
The Entire Four-Graviton EFT from the Duality Between Color and Kinematics
Authors:
John Joseph M. Carrasco,
Sai Sasank Chava,
Alex Edison,
Eliseu Kloster,
Suna Zekioğlu
Abstract:
The Bern-Carrasco-Johansson (BCJ) double-copy construction reveals a fundamental structural connection between gauge and gravity theories. At its core, the BCJ double copy is directly due to a duality between the algebraic relations of a color root and those of a kinematic root. We generalize this principle beyond the conventional Lie algebra structure of tree-level Yang-Mills theory. By demanding…
▽ More
The Bern-Carrasco-Johansson (BCJ) double-copy construction reveals a fundamental structural connection between gauge and gravity theories. At its core, the BCJ double copy is directly due to a duality between the algebraic relations of a color root and those of a kinematic root. We generalize this principle beyond the conventional Lie algebra structure of tree-level Yang-Mills theory. By demanding color-kinematics duality for the complete basis of four-point color structures -- including those involving the symmetric $d^{abc}$ constants -- we define the universal double copy. We systematically classify the bases of all such parity-even generalized gauge-theory numerators and, independently, the space of all parity-even four-graviton higher-derivative operators. We demonstrate that our universal double-copy construction precisely spans the entire tower of parity-even four-graviton amplitudes in any dimension, except for the Lovelock $R^3$ contribution in $D >6$ which we can express in terms of a particularly simple universal triple-copy involving gauge theories coupled to scalars. Explicit machine-readable expressions for the complete basis of gauge-theory numerators and fundamental gravitational building blocks are provided in the ancillary files. This establishes that all possible four-point gravitational interactions can be factorized into products of gauge-theory building blocks governed by this universal notion of color-kinematics duality.
△ Less
Submitted 14 January, 2026;
originally announced January 2026.
-
FinForge: Semi-Synthetic Financial Benchmark Generation
Authors:
Glenn Matlin,
Akhil Theerthala,
Anant Gupta,
Anirudh JM,
Rayan Castilla,
Yi Mei Ng,
Sudheer Chava
Abstract:
Evaluating Language Models (LMs) in specialized, high-stakes domains such as finance remains a significant challenge due to the scarcity of open, high-quality, and domain-specific datasets. Existing general-purpose benchmarks provide broad coverage but lack the depth and domain fidelity needed to assess LMs' capabilities for real-world financial reasoning, which requires both conceptual understand…
▽ More
Evaluating Language Models (LMs) in specialized, high-stakes domains such as finance remains a significant challenge due to the scarcity of open, high-quality, and domain-specific datasets. Existing general-purpose benchmarks provide broad coverage but lack the depth and domain fidelity needed to assess LMs' capabilities for real-world financial reasoning, which requires both conceptual understanding and quantitative rigor. To address this gap, we introduce FinForge, a scalable, semi-synthetic pipeline for constructing finance-specific evaluation benchmarks through a hybrid of expert-guided data curation and controlled LM-based synthesis. FinForge combines manual and programmatic corpus construction from authoritative financial sources with structured question generation and validation using Gemini 2.5 Flash. To demonstrate the pipeline's efficacy, we produce FinForge-5k, a snapshot benchmark comprising over 5,000 human-validated question-answer pairs across 11 finance subdomains, derived from a curated corpus of 100,000 verified documents totaling 143M tokens. Evaluation of state-of-the-art open-source and closed-source models on FinForge-5k reveals significant differences in financial reasoning, with leading models achieving accuracy levels near 80%. These findings underscore the framework's utility for diagnosing current model limitations and guiding future improvements in financial domain competence. All code and data are available at https://github.com/gtfintechlab/FinForge.
△ Less
Submitted 19 January, 2026; v1 submitted 10 January, 2026;
originally announced January 2026.
-
Financial Instruction Following Evaluation (FIFE)
Authors:
Glenn Matlin,
Siddharth,
Anirudh JM,
Aditya Shukla,
Yahya Hassan,
Sudheer Chava
Abstract:
Language Models (LMs) struggle with complex, interdependent instructions, particularly in high-stakes domains like finance where precision is critical. We introduce FIFE, a novel, high-difficulty benchmark designed to assess LM instruction-following capabilities for financial analysis tasks. FIFE comprises 88 human-authored prompts and employs a verification system with chainable, verifiable const…
▽ More
Language Models (LMs) struggle with complex, interdependent instructions, particularly in high-stakes domains like finance where precision is critical. We introduce FIFE, a novel, high-difficulty benchmark designed to assess LM instruction-following capabilities for financial analysis tasks. FIFE comprises 88 human-authored prompts and employs a verification system with chainable, verifiable constraints for fine-grained reward signals. We evaluate 53 models (proprietary, open-weight, open-source) in a zero-shot setting. Our key findings reveal a clear performance hierarchy: the top open-weight model (76.1 strict / 79.5 loose) surpasses the leading proprietary system (65.9 strict / 70.5 loose), while the best open-source models lag significantly (45.5 strict / 48.9 loose). However, even top-performing models struggle with FIFE's complex requirements, failing to achieve perfect compliance. We release our dataset and code as an open-source resource to promote research in Reinforcement Learning for the financial domain.
△ Less
Submitted 30 November, 2025;
originally announced December 2025.
-
$D$-Dimensional Modular Assembly of Higher-Derivative Four-Point Contact Amplitudes Involving Fermions
Authors:
John Joseph M. Carrasco,
Sai Sasank Chava,
Alex Edison,
Aslan Seifi
Abstract:
We present a novel robust framework for systematically constructing $D$-dimensional four-point higher-derivative contact amplitudes. Our modular block ("LEGO"-like) approach builds amplitudes directly from manifestly gauge-invariant kinematic blocks, color-weight factors, and scalar Mandelstam polynomials. Symmetries (Bose/Fermi) are imposed algebraically, acting as filters on combinations of comp…
▽ More
We present a novel robust framework for systematically constructing $D$-dimensional four-point higher-derivative contact amplitudes. Our modular block ("LEGO"-like) approach builds amplitudes directly from manifestly gauge-invariant kinematic blocks, color-weight factors, and scalar Mandelstam polynomials. Symmetries (Bose/Fermi) are imposed algebraically, acting as filters on combinations of compatible pieces. This framework operates entirely in $D$ dimensions, naturally incorporating evanescent operators crucial for loop-level consistency. Scaling to arbitrary mass dimension is achieved in a highly controlled manner using permutation-invariant scalar polynomials, avoiding combinatorial explosion. A key feature is its manifest compatibility with the double-copy program, allowing the systematic generation of operator towers not only for gauge theories but also for gravity and other theories within the double-copy web.
△ Less
Submitted 6 April, 2026; v1 submitted 7 November, 2025;
originally announced November 2025.
-
Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Authors:
Rongzhi Zhang,
Liqin Ye,
Yuzhao Heng,
Xiang Chen,
Tong Yu,
Lingkai Kong,
Sudheer Chava,
Chao Zhang
Abstract:
Precise attribute intensity control--generating Large Language Model (LLM) outputs with specific, user-defined attribute intensities--is crucial for AI systems adaptable to diverse user expectations. Current LLM alignment methods, however, typically provide only directional or open-ended guidance, failing to reliably achieve exact attribute intensities. We address this limitation with three key de…
▽ More
Precise attribute intensity control--generating Large Language Model (LLM) outputs with specific, user-defined attribute intensities--is crucial for AI systems adaptable to diverse user expectations. Current LLM alignment methods, however, typically provide only directional or open-ended guidance, failing to reliably achieve exact attribute intensities. We address this limitation with three key designs: (1) reformulating precise attribute intensity control as a target-reaching problem, rather than simple maximization; (2) training a lightweight value function via temporal-difference learning to predict final attribute intensity scores from partial generations, thereby steering LLM outputs; and (3) employing gradient-based interventions on hidden representations to navigate the model precisely towards specific attribute intensity targets. Our method enables fine-grained, continuous control over attribute intensities, moving beyond simple directional alignment. Experiments on LLaMA-3.2-3b and Phi-4-mini confirm our method's ability to steer text generation to user-specified attribute intensities with high accuracy. Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference. Our code is available on https://github.com/Pre-Control/pre-control
△ Less
Submitted 17 February, 2026; v1 submitted 13 October, 2025;
originally announced October 2025.
-
FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos
Authors:
Siddhant Sukhani,
Yash Bhardwaj,
Riya Bhadani,
Veer Kejriwal,
Michael Galarnyk,
Sudheer Chava
Abstract:
We evaluate multimodal large language models (MLLMs) for topic-aligned captioning in financial short-form videos (SVs) by testing joint reasoning over transcripts (T), audio (A), and video (V). Using 624 annotated YouTube SVs, we assess all seven modality combinations (T, A, V, TA, TV, AV, TAV) across five topics: main recommendation, sentiment analysis, video purpose, visual analysis, and financi…
▽ More
We evaluate multimodal large language models (MLLMs) for topic-aligned captioning in financial short-form videos (SVs) by testing joint reasoning over transcripts (T), audio (A), and video (V). Using 624 annotated YouTube SVs, we assess all seven modality combinations (T, A, V, TA, TV, AV, TAV) across five topics: main recommendation, sentiment analysis, video purpose, visual analysis, and financial entity recognition. Video alone performs strongly on four of five topics, underscoring its value for capturing visual context and effective cues such as emotions, gestures, and body language. Selective pairs such as TV or AV often surpass TAV, implying that too many modalities may introduce noise. These results establish the first baselines for financial short-form video captioning and illustrate the potential and challenges of grounding complex visual cues in this domain. All code and data can be found on our Github under the CC-BY-NC-SA 4.0 license.
△ Less
Submitted 30 September, 2025;
originally announced September 2025.
-
VideoConviction: A Multimodal Benchmark for Human Conviction and Stock Market Recommendations
Authors:
Michael Galarnyk,
Veer Kejriwal,
Agam Shah,
Yash Bhardwaj,
Nicholas Meyer,
Anand Krishnan,
Sudheer Chava
Abstract:
Social media has amplified the reach of financial influencers known as "finfluencers," who share stock recommendations on platforms like YouTube. Understanding their influence requires analyzing multimodal signals like tone, delivery style, and facial expressions, which extend beyond text-based financial analysis. We introduce VideoConviction, a multimodal dataset with 6,000+ expert annotations, p…
▽ More
Social media has amplified the reach of financial influencers known as "finfluencers," who share stock recommendations on platforms like YouTube. Understanding their influence requires analyzing multimodal signals like tone, delivery style, and facial expressions, which extend beyond text-based financial analysis. We introduce VideoConviction, a multimodal dataset with 6,000+ expert annotations, produced through 457 hours of human effort, to benchmark multimodal large language models (MLLMs) and text-based large language models (LLMs) in financial discourse. Our results show that while multimodal inputs improve stock ticker extraction (e.g., extracting Apple's ticker AAPL), both MLLMs and LLMs struggle to distinguish investment actions and conviction--the strength of belief conveyed through confident delivery and detailed reasoning--often misclassifying general commentary as definitive recommendations. While high-conviction recommendations perform better than low-conviction ones, they still underperform the popular S\&P 500 index fund. An inverse strategy--betting against finfluencer recommendations--outperforms the S\&P 500 by 6.8\% in annual returns but carries greater risk (Sharpe ratio of 0.41 vs. 0.65). Our benchmark enables a diverse evaluation of multimodal tasks, comparing model performance on both full video and segmented video inputs. This enables deeper advancements in multimodal financial research. Our code, dataset, and evaluation leaderboard are available under the CC BY-NC 4.0 license.
△ Less
Submitted 4 June, 2025;
originally announced July 2025.
-
Finance Language Model Evaluation (FLaME)
Authors:
Glenn Matlin,
Mika Okamoto,
Huzaifa Pardawala,
Yang Yang,
Sudheer Chava
Abstract:
Language Models (LMs) have demonstrated impressive capabilities with core Natural Language Processing (NLP) tasks. The effectiveness of LMs for highly specialized knowledge-intensive tasks in finance remains difficult to assess due to major gaps in the methodologies of existing evaluation frameworks, which have caused an erroneous belief in a far lower bound of LMs' performance on common Finance N…
▽ More
Language Models (LMs) have demonstrated impressive capabilities with core Natural Language Processing (NLP) tasks. The effectiveness of LMs for highly specialized knowledge-intensive tasks in finance remains difficult to assess due to major gaps in the methodologies of existing evaluation frameworks, which have caused an erroneous belief in a far lower bound of LMs' performance on common Finance NLP (FinNLP) tasks. To demonstrate the potential of LMs for these FinNLP tasks, we present the first holistic benchmarking suite for Financial Language Model Evaluation (FLaME). We are the first research paper to comprehensively study LMs against 'reasoning-reinforced' LMs, with an empirical study of 23 foundation LMs over 20 core NLP tasks in finance. We open-source our framework software along with all data and results.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement
Authors:
Liqin Ye,
Agam Shah,
Chao Zhang,
Sudheer Chava
Abstract:
The traditional process of creating labeled datasets is labor-intensive and expensive. Recent breakthroughs in open-source large language models (LLMs) have opened up a new avenue in generating labeled datasets automatically for various natural language processing (NLP) tasks, providing an alternative to such an expensive annotation process. However, the reliability of such auto-generated labels r…
▽ More
The traditional process of creating labeled datasets is labor-intensive and expensive. Recent breakthroughs in open-source large language models (LLMs) have opened up a new avenue in generating labeled datasets automatically for various natural language processing (NLP) tasks, providing an alternative to such an expensive annotation process. However, the reliability of such auto-generated labels remains a significant concern due to inherent inaccuracies. When learning from noisy labels, the model's generalization is likely to be harmed as it is prone to overfit to those label noises. While previous studies in learning from noisy labels mainly focus on synthetic noise and real-world noise, LLM-generated label noise receives less attention. In this paper, we propose SiDyP: Simplex Label Diffusion with Dynamic Prior to calibrate the classifier's prediction, thus enhancing its robustness towards LLM-generated noisy labels. SiDyP retrieves potential true label candidates by neighborhood label distribution in text embedding space and iteratively refines noisy candidates using a simplex diffusion model. Our framework can increase the performance of the BERT classifier fine-tuned on both zero-shot and few-shot LLM-generated noisy label datasets by an average of 7.21% and 7.30% respectively. We demonstrate the effectiveness of SiDyP by conducting extensive benchmarking for different LLMs over a variety of NLP tasks. Our code is available on Github.
△ Less
Submitted 20 June, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally
Authors:
Agam Shah,
Siddhant Sukhani,
Huzaifa Pardawala,
Saketh Budideti,
Riya Bhadani,
Rudra Gopal,
Siddhartha Somani,
Rutwik Routu,
Michael Galarnyk,
Soungmin Lee,
Arnav Hiray,
Akshar Ravichandran,
Eric Kim,
Pranav Aluru,
Joshua Zhang,
Sebastian Jaskowski,
Veer Guda,
Meghaj Tarte,
Liqin Ye,
Spencer Gosden,
Rachel Yuh,
Sloka Chava,
Sahasra Chava,
Dylan Patrick Kelly,
Aiden Chiang
, et al. (2 additional authors not shown)
Abstract:
Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences fr…
▽ More
Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences from 25 central banks across diverse geographic regions, spanning 28 years of historical data. After uniformly sampling 1k sentences per bank (25k total) across all available years, we annotate and review each sentence using dual annotators, disagreement resolutions, and secondary expert reviews. We define three tasks: Stance Detection, Temporal Classification, and Uncertainty Estimation, with each sentence annotated for all three. We benchmark seven Pretrained Language Models (PLMs) and nine Large Language Models (LLMs) (Zero-Shot, Few-Shot, and with annotation guide) on these tasks, running 15,075 benchmarking experiments. We find that a model trained on aggregated data across banks significantly surpasses a model trained on an individual bank's data, confirming the principle "the whole is greater than the sum of its parts." Additionally, rigorous human evaluations, error analyses, and predictive tasks validate our framework's economic utility. Our artifacts are accessible through the HuggingFace and GitHub under the CC-BY-NC-SA 4.0 license.
△ Less
Submitted 1 November, 2025; v1 submitted 15 May, 2025;
originally announced May 2025.
-
KG-MuLQA: A Framework for KG-based Multi-Level QA Extraction and Long-Context LLM Evaluation
Authors:
Nikita Tatarinov,
Vidhyakshaya Kannan,
Haricharana Srinivasa,
Arnav Raj,
Harpreet Singh Anand,
Varun Singh,
Aditya Luthra,
Ravij Lade,
Agam Shah,
Sudheer Chava
Abstract:
We introduce KG-MuLQA (Knowledge-Graph-based Multi-Level Question-Answer Extraction): a framework that (1) extracts QA pairs at multiple complexity levels (2) along three key dimensions -- multi-hop retrieval, set operations, and answer plurality, (3) by leveraging knowledge-graph-based document representations. This approach enables fine-grained assessment of model performance across controlled d…
▽ More
We introduce KG-MuLQA (Knowledge-Graph-based Multi-Level Question-Answer Extraction): a framework that (1) extracts QA pairs at multiple complexity levels (2) along three key dimensions -- multi-hop retrieval, set operations, and answer plurality, (3) by leveraging knowledge-graph-based document representations. This approach enables fine-grained assessment of model performance across controlled difficulty levels. Using this framework, we construct a dataset of 20,139 QA pairs based on financial credit agreements and evaluate 16 proprietary and open-weight Large Language Models, observing that even the best-performing models struggle with set-based comparisons and multi-hop reasoning over long contexts. Our analysis reveals systematic failure modes tied to semantic misinterpretation and inability to handle implicit relations.
△ Less
Submitted 9 January, 2026; v1 submitted 18 May, 2025;
originally announced May 2025.
-
Language Modeling for the Future of Finance: A Survey into Metrics, Tasks, and Data Opportunities
Authors:
Nikita Tatarinov,
Siddhant Sukhani,
Agam Shah,
Sudheer Chava
Abstract:
Recent advances in language modeling have led to a growing number of papers related to finance in top-tier Natural Language Processing (NLP) venues. To systematically examine this trend, we review 374 NLP research papers published between 2017 and 2024 across 38 conferences and workshops, with a focused analysis of 221 papers that directly address finance-related tasks. We evaluate these papers ac…
▽ More
Recent advances in language modeling have led to a growing number of papers related to finance in top-tier Natural Language Processing (NLP) venues. To systematically examine this trend, we review 374 NLP research papers published between 2017 and 2024 across 38 conferences and workshops, with a focused analysis of 221 papers that directly address finance-related tasks. We evaluate these papers across 11 quantitative and qualitative dimensions, and our study identifies the following opportunities for NLP researchers: (i) expanding the scope of forecasting tasks; (ii) enriching evaluation with financial metrics; (iii) leveraging multilingual and crisis-period datasets; and (iv) balancing PLMs with efficient or interpretable alternatives. We identify actionable directions supported by dataset and tool recommendations, with implications for both the academia and industry communities.
△ Less
Submitted 14 October, 2025; v1 submitted 9 April, 2025;
originally announced April 2025.
-
Beyond the Reported Cutoff: Where Large Language Models Fall Short on Financial Knowledge
Authors:
Agam Shah,
Liqin Ye,
Sebastian Jaskowski,
Wei Xu,
Sudheer Chava
Abstract:
Large Language Models (LLMs) are frequently utilized as sources of knowledge for question-answering. While it is known that LLMs may lack access to real-time data or newer data produced after the model's cutoff date, it is less clear how their knowledge spans across historical information. In this study, we assess the breadth of LLMs' knowledge using financial data of U.S. publicly traded companie…
▽ More
Large Language Models (LLMs) are frequently utilized as sources of knowledge for question-answering. While it is known that LLMs may lack access to real-time data or newer data produced after the model's cutoff date, it is less clear how their knowledge spans across historical information. In this study, we assess the breadth of LLMs' knowledge using financial data of U.S. publicly traded companies by evaluating more than 197k questions and comparing model responses to factual data. We further explore the impact of company characteristics, such as size, retail investment, institutional attention, and readability of financial filings, on the accuracy of knowledge represented in LLMs. Our results reveal that LLMs are less informed about past financial performance, but they display a stronger awareness of larger companies and more recent information. Interestingly, at the same time, our analysis also reveals that LLMs are more likely to hallucinate for larger companies, especially for data from more recent years. The code, prompts, and model outputs are available on GitHub.
△ Less
Submitted 28 July, 2025; v1 submitted 30 March, 2025;
originally announced April 2025.
-
How Inclusively do LMs Perceive Social and Moral Norms?
Authors:
Michael Galarnyk,
Agam Shah,
Dipanwita Guhathakurta,
Poojitha Nandigam,
Sudheer Chava
Abstract:
This paper discusses and contains offensive content. Language models (LMs) are used in decision-making systems and as interactive assistants. However, how well do these models making judgements align with the diversity of human values, particularly regarding social and moral norms? In this work, we investigate how inclusively LMs perceive norms across demographic groups (e.g., gender, age, and inc…
▽ More
This paper discusses and contains offensive content. Language models (LMs) are used in decision-making systems and as interactive assistants. However, how well do these models making judgements align with the diversity of human values, particularly regarding social and moral norms? In this work, we investigate how inclusively LMs perceive norms across demographic groups (e.g., gender, age, and income). We prompt 11 LMs on rules-of-thumb (RoTs) and compare their outputs with the existing responses of 100 human annotators. We introduce the Absolute Distance Alignment Metric (ADA-Met) to quantify alignment on ordinal questions. We find notable disparities in LM responses, with younger, higher-income groups showing closer alignment, raising concerns about the representation of marginalized perspectives. Our findings highlight the importance of further efforts to make LMs more inclusive of diverse human values. The code and prompts are available on GitHub under the CC BY-NC 4.0 license.
△ Less
Submitted 16 April, 2025; v1 submitted 4 February, 2025;
originally announced February 2025.
-
SubjECTive-QA: Measuring Subjectivity in Earnings Call Transcripts' QA Through Six-Dimensional Feature Analysis
Authors:
Huzaifa Pardawala,
Siddhant Sukhani,
Agam Shah,
Veer Kejriwal,
Abhishek Pillai,
Rohan Bhasin,
Andrew DiBiasio,
Tarun Mandapati,
Dhruv Adha,
Sudheer Chava
Abstract:
Fact-checking is extensively studied in the context of misinformation and disinformation, addressing objective inaccuracies. However, a softer form of misinformation involves responses that are factually correct but lack certain features such as clarity and relevance. This challenge is prevalent in formal Question-Answer (QA) settings such as press conferences in finance, politics, sports, and oth…
▽ More
Fact-checking is extensively studied in the context of misinformation and disinformation, addressing objective inaccuracies. However, a softer form of misinformation involves responses that are factually correct but lack certain features such as clarity and relevance. This challenge is prevalent in formal Question-Answer (QA) settings such as press conferences in finance, politics, sports, and other domains, where subjective answers can obscure transparency. Despite this, there is a lack of manually annotated datasets for subjective features across multiple dimensions. To address this gap, we introduce SubjECTive-QA, a human annotated dataset on Earnings Call Transcripts' (ECTs) QA sessions as the answers given by company representatives are often open to subjective interpretations and scrutiny. The dataset includes 49,446 annotations for long-form QA pairs across six features: Assertive, Cautious, Optimistic, Specific, Clear, and Relevant. These features are carefully selected to encompass the key attributes that reflect the tone of the answers provided during QA sessions across different domain. Our findings are that the best-performing Pre-trained Language Model (PLM), RoBERTa-base, has similar weighted F1 scores to Llama-3-70b-Chat on features with lower subjectivity, such as Relevant and Clear, with a mean difference of 2.17% in their weighted F1 scores. The models perform significantly better on features with higher subjectivity, such as Specific and Assertive, with a mean difference of 10.01% in their weighted F1 scores. Furthermore, testing SubjECTive-QA's generalizability using QAs from White House Press Briefings and Gaggles yields an average weighted F1 score of 65.97% using our best models for each feature, demonstrating broader applicability beyond the financial domain. SubjECTive-QA is publicly available under the CC BY 4.0 license
△ Less
Submitted 23 January, 2025; v1 submitted 27 October, 2024;
originally announced October 2024.
-
CoCoHD: Congress Committee Hearing Dataset
Authors:
Arnav Hiray,
Yunsong Liu,
Mingxiao Song,
Agam Shah,
Sudheer Chava
Abstract:
U.S. congressional hearings significantly influence the national economy and social fabric, impacting individual lives. Despite their importance, there is a lack of comprehensive datasets for analyzing these discourses. To address this, we propose the Congress Committee Hearing Dataset (CoCoHD), covering hearings from 1997 to 2024 across 86 committees, with 32,697 records. This dataset enables res…
▽ More
U.S. congressional hearings significantly influence the national economy and social fabric, impacting individual lives. Despite their importance, there is a lack of comprehensive datasets for analyzing these discourses. To address this, we propose the Congress Committee Hearing Dataset (CoCoHD), covering hearings from 1997 to 2024 across 86 committees, with 32,697 records. This dataset enables researchers to study policy language on critical issues like healthcare, LGBTQ+ rights, and climate justice. We demonstrate its potential with a case study on 1,000 energy-related sentences, analyzing the Energy and Commerce Committee's stance on fossil fuel consumption. By fine-tuning pre-trained language models, we create energy-relevant measures for each hearing. Our market analysis shows that natural language analysis using CoCoHD can predict and highlight trends in the energy sector.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
BERTScoreVisualizer: A Web Tool for Understanding Simplified Text Evaluation with BERTScore
Authors:
Sebastian Jaskowski,
Sahasra Chava,
Agam Shah
Abstract:
The BERTScore metric is commonly used to evaluate automatic text simplification systems. However, current implementations of the metric fail to provide complete visibility into all information the metric can produce. Notably, the specific token matchings can be incredibly useful in generating clause-level insight into the quality of simplified text. We address this by introducing BERTScoreVisualiz…
▽ More
The BERTScore metric is commonly used to evaluate automatic text simplification systems. However, current implementations of the metric fail to provide complete visibility into all information the metric can produce. Notably, the specific token matchings can be incredibly useful in generating clause-level insight into the quality of simplified text. We address this by introducing BERTScoreVisualizer, a web application that goes beyond reporting precision, recall, and F1 score and provides a visualization of the matching between tokens. We believe that our software can help improve the analysis of text simplification systems by specifically showing where generated, simplified text deviates from reference text. We host our code and demo on GitHub.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
ConfReady: A RAG based Assistant and Dataset for Conference Checklist Responses
Authors:
Michael Galarnyk,
Rutwik Routu,
Vidhyakshaya Kannan,
Kosha Bheda,
Prasun Banerjee,
Agam Shah,
Sudheer Chava
Abstract:
The ARR Responsible NLP Research checklist website states that the "checklist is designed to encourage best practices for responsible research, addressing issues of research ethics, societal impact and reproducibility." Answering the questions is an opportunity for authors to reflect on their work and make sure any shared scientific assets follow best practices. Ideally, considering a checklist be…
▽ More
The ARR Responsible NLP Research checklist website states that the "checklist is designed to encourage best practices for responsible research, addressing issues of research ethics, societal impact and reproducibility." Answering the questions is an opportunity for authors to reflect on their work and make sure any shared scientific assets follow best practices. Ideally, considering a checklist before submission can favorably impact the writing of a research paper. However, previous research has shown that self-reported checklist responses don't always accurately represent papers. In this work, we introduce ConfReady, a retrieval-augmented generation (RAG) application that can be used to empower authors to reflect on their work and assist authors with conference checklists. To evaluate checklist assistants, we curate a dataset of 1,975 ACL checklist responses, analyze problems in human answers, and benchmark RAG and Large Language Model (LM) based systems on an evaluation subset. Our code is released under the AGPL-3.0 license on GitHub, with documentation covering the user interface and PyPI package.
△ Less
Submitted 19 September, 2025; v1 submitted 7 August, 2024;
originally announced August 2024.
-
Numerical Claim Detection in Finance: A New Financial Dataset, Weak-Supervision Model, and Market Analysis
Authors:
Agam Shah,
Arnav Hiray,
Pratvi Shah,
Arkaprabha Banerjee,
Anushka Singh,
Dheeraj Eidnani,
Sahasra Chava,
Bhaskar Chaudhury,
Sudheer Chava
Abstract:
In this paper, we investigate the influence of claims in analyst reports and earnings calls on financial market returns, considering them as significant quarterly events for publicly traded companies. To facilitate a comprehensive analysis, we construct a new financial dataset for the claim detection task in the financial domain. We benchmark various language models on this dataset and propose a n…
▽ More
In this paper, we investigate the influence of claims in analyst reports and earnings calls on financial market returns, considering them as significant quarterly events for publicly traded companies. To facilitate a comprehensive analysis, we construct a new financial dataset for the claim detection task in the financial domain. We benchmark various language models on this dataset and propose a novel weak-supervision model that incorporates the knowledge of subject matter experts (SMEs) in the aggregation function, outperforming existing approaches. We also demonstrate the practical utility of our proposed model by constructing a novel measure of optimism. Here, we observe the dependence of earnings surprise and return on our optimism measure. Our dataset, models, and code are publicly (under CC BY 4.0 license) available on GitHub.
△ Less
Submitted 4 October, 2024; v1 submitted 18 February, 2024;
originally announced February 2024.
-
A Search for Technosignatures Around 11,680 Stars with the Green Bank Telescope at 1.15-1.73 GHz
Authors:
Jean-Luc Margot,
Megan G. Li,
Pavlo Pinchuk,
Nathan Myhrvold,
Larry Lesyna,
Lea E. Alcantara,
Megan T. Andrakin,
Jeth Arunseangroj,
Damien S. Baclet,
Madison H. Belk,
Zerxes R. Bhadha,
Nicholas W. Brandis,
Robert E. Carey,
Harrison P. Cassar,
Sai S. Chava,
Calvin Chen,
James Chen,
Kellen T. Cheng,
Alessia Cimbri,
Benjamin Cloutier,
Jordan A. Combitsis,
Kelly L. Couvrette,
Brandon P. Coy,
Kyle W. Davis,
Antoine F. Delcayre
, et al. (56 additional authors not shown)
Abstract:
We conducted a search for narrowband radio signals over four observing sessions in 2020-2023 with the L-band receiver (1.15-1.73 GHz) of the 100 m diameter Green Bank Telescope. We pointed the telescope in the directions of 62 TESS Objects of Interest, capturing radio emissions from a total of ~11,680 stars and planetary systems in the ~9 arcminute beam of the telescope. All detections were either…
▽ More
We conducted a search for narrowband radio signals over four observing sessions in 2020-2023 with the L-band receiver (1.15-1.73 GHz) of the 100 m diameter Green Bank Telescope. We pointed the telescope in the directions of 62 TESS Objects of Interest, capturing radio emissions from a total of ~11,680 stars and planetary systems in the ~9 arcminute beam of the telescope. All detections were either automatically rejected or visually inspected and confirmed to be of anthropogenic nature. In this work, we also quantified the end-to-end efficiency of radio SETI pipelines with a signal injection and recovery analysis. The UCLA SETI pipeline recovers 94.0% of the injected signals over the usable frequency range of the receiver and 98.7% of the injections when regions of dense RFI are excluded. In another pipeline that uses incoherent sums of 51 consecutive spectra, the recovery rate is ~15 times smaller at ~6%. The pipeline efficiency affects calculations of transmitter prevalence and SETI search volume. Accordingly, we developed an improved Drake Figure of Merit and a formalism to place upper limits on transmitter prevalence that take the pipeline efficiency and transmitter duty cycle into account. Based on our observations, we can state at the 95% confidence level that fewer than 6.6% of stars within 100 pc host a transmitter that is detectable in our search (EIRP > 1e13 W). For stars within 20,000 ly, the fraction of stars with detectable transmitters (EIRP > 5e16 W) is at most 3e-4. Finally, we showed that the UCLA SETI pipeline natively detects the signals detected with AI techniques by Ma et al. (2023).
△ Less
Submitted 15 October, 2023; v1 submitted 4 August, 2023;
originally announced August 2023.
-
Shifting Cryptocurrency Influence: A High-Resolution Network Analysis of Market Leaders
Authors:
Arnav Hiray,
Pratvi Shah,
Vishwa Shah,
Agam Shah,
Sudheer Chava,
Mukesh Tiwari
Abstract:
Over the last decade, the cryptocurrency market has experienced unprecedented growth, emerging as a prominent financial market. As this market rapidly evolves, it necessitates re-evaluating which cryptocurrencies command the market and steer the direction of blockchain technology. We implement a network-based cryptocurrency market analysis to investigate this changing landscape. We use novel hourl…
▽ More
Over the last decade, the cryptocurrency market has experienced unprecedented growth, emerging as a prominent financial market. As this market rapidly evolves, it necessitates re-evaluating which cryptocurrencies command the market and steer the direction of blockchain technology. We implement a network-based cryptocurrency market analysis to investigate this changing landscape. We use novel hourly-resolution data and Kendall's Tau correlation to explore the interconnectedness of the cryptocurrency market. We observed critical differences in the hierarchy of cryptocurrencies determined by our method compared to rankings derived from daily data and Pearson's correlation. This divergence emphasizes the potential information loss stemming from daily data aggregation and highlights the limitations of Pearson's correlation. Our findings show that in the early stages of this growth, Bitcoin held a leading role. However, during the 2021 bull run, the landscape changed drastically. We see that while Ethereum has emerged as the overall leader, it was FTT and its associated exchange, FTX, that greatly led to the increase at the beginning of the bull run. We also find that highly-influential cryptocurrencies are increasingly gaining a commanding influence over the market as time progresses, despite the growing number of cryptocurrencies making up the market.
△ Less
Submitted 30 January, 2024; v1 submitted 31 July, 2023;
originally announced July 2023.
-
Abnormal Trading Detection in the NFT Market
Authors:
Mingxiao Song,
Yunsong Liu,
Agam Shah,
Sudheer Chava
Abstract:
The Non-Fungible-Token (NFT) market has experienced explosive growth in recent years. According to DappRadar, the total transaction volume on OpenSea, the largest NFT marketplace, reached 34.7 billion dollars in February 2023. However, the NFT market is mostly unregulated and there are significant concerns about money laundering, fraud and wash trading. The lack of industry-wide regulations, and t…
▽ More
The Non-Fungible-Token (NFT) market has experienced explosive growth in recent years. According to DappRadar, the total transaction volume on OpenSea, the largest NFT marketplace, reached 34.7 billion dollars in February 2023. However, the NFT market is mostly unregulated and there are significant concerns about money laundering, fraud and wash trading. The lack of industry-wide regulations, and the fact that amateur traders and retail investors comprise a significant fraction of the NFT market, make this market particularly vulnerable to fraudulent activities. Therefore it is essential to investigate and highlight the relevant risks involved in NFT trading. In this paper, we attempted to uncover common fraudulent behaviors such as wash trading that could mislead other traders. Using market data, we designed quantitative features from the network, monetary, and temporal perspectives that were fed into K-means clustering unsupervised learning algorithm to sort traders into groups. Lastly, we discussed the clustering results' significance and how regulations can reduce undesired behaviors. Our work can potentially help regulators narrow down their search space for bad actors in the market as well as provide insights for amateur traders to protect themselves from unforeseen frauds.
△ Less
Submitted 2 August, 2023; v1 submitted 25 May, 2023;
originally announced June 2023.
-
Zero is Not Hero Yet: Benchmarking Zero-Shot Performance of LLMs for Financial Tasks
Authors:
Agam Shah,
Sudheer Chava
Abstract:
Recently large language models (LLMs) like ChatGPT have shown impressive performance on many natural language processing tasks with zero-shot. In this paper, we investigate the effectiveness of zero-shot LLMs in the financial domain. We compare the performance of ChatGPT along with some open-source generative LLMs in zero-shot mode with RoBERTa fine-tuned on annotated data. We address three inter-…
▽ More
Recently large language models (LLMs) like ChatGPT have shown impressive performance on many natural language processing tasks with zero-shot. In this paper, we investigate the effectiveness of zero-shot LLMs in the financial domain. We compare the performance of ChatGPT along with some open-source generative LLMs in zero-shot mode with RoBERTa fine-tuned on annotated data. We address three inter-related research questions on data annotation, performance gaps, and the feasibility of employing generative models in the finance domain. Our findings demonstrate that ChatGPT performs well even without labeled data but fine-tuned models generally outperform it. Our research also highlights how annotating with generative models can be time-intensive. Our codebase is publicly available on GitHub under CC BY-NC 4.0 license.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Trillion Dollar Words: A New Financial Dataset, Task & Market Analysis
Authors:
Agam Shah,
Suvan Paturi,
Sudheer Chava
Abstract:
Monetary policy pronouncements by Federal Open Market Committee (FOMC) are a major driver of financial market returns. We construct the largest tokenized and annotated dataset of FOMC speeches, meeting minutes, and press conference transcripts in order to understand how monetary policy influences financial markets. In this study, we develop a novel task of hawkish-dovish classification and benchma…
▽ More
Monetary policy pronouncements by Federal Open Market Committee (FOMC) are a major driver of financial market returns. We construct the largest tokenized and annotated dataset of FOMC speeches, meeting minutes, and press conference transcripts in order to understand how monetary policy influences financial markets. In this study, we develop a novel task of hawkish-dovish classification and benchmark various pre-trained language models on the proposed dataset. Using the best-performing model (RoBERTa-large), we construct a measure of monetary policy stance for the FOMC document release days. To evaluate the constructed measure, we study its impact on the treasury market, stock market, and macroeconomic indicators. Our dataset, models, and code are publicly available on Huggingface and GitHub under CC BY-NC 4.0 license.
△ Less
Submitted 13 May, 2023;
originally announced May 2023.
-
The Universal NFT Vector Database: A Scaleable Vector Database for NFT Similarity Matching
Authors:
Samrat Sahoo,
Nitin Paul,
Agam Shah,
Andrew Hornback,
Sudheer Chava
Abstract:
Non-Fungible Tokens (NFTs) are a type of digital asset that represents a proof of ownership over a particular digital item such as art, music, or real estate. Due to the non-fungible nature of NFTs, duplicate tokens should not possess the same value. However, with the surge of new blockchains and a massive influx of NFTs being created, a wealth of NFT data is being generated without a method of tr…
▽ More
Non-Fungible Tokens (NFTs) are a type of digital asset that represents a proof of ownership over a particular digital item such as art, music, or real estate. Due to the non-fungible nature of NFTs, duplicate tokens should not possess the same value. However, with the surge of new blockchains and a massive influx of NFTs being created, a wealth of NFT data is being generated without a method of tracking similarity. This enables people to create almost identical NFTs by changing one pixel or one byte of data. Despite the similarity among NFTs, each NFT is assigned a completely different token ID. To address the NFT duplication issue, we developed a modular, easily-extendable, hardware-agnostic, cloud-centered NFT processing system that represents NFTs as vectors. We established a database containing a vector representation of the NFTs in accordance with the Ethereum Request for Comment 721 (ERC-721) token standards to initiate the process of aggregating NFT data from various blockchains. Finally, we developed an NFT visualization dashboard application with a user-friendly graphical user interface (GUI) to provide non-technical users access to the aggregated NFT data. The Universal NFT Vector Database is an off-chain framework for NFT data aggregation based on similarity, which provides an organized way to query and analyze NFT data that was previously unavailable through on-chain solutions.
△ Less
Submitted 22 March, 2023;
originally announced March 2023.
-
FiNER-ORD: Financial Named Entity Recognition Open Research Dataset
Authors:
Agam Shah,
Abhinav Gullapalli,
Ruchit Vithani,
Michael Galarnyk,
Sudheer Chava
Abstract:
Over the last two decades, the development of the CoNLL-2003 named entity recognition (NER) dataset has helped enhance the capabilities of deep learning and natural language processing (NLP). The finance domain, characterized by its unique semantic and lexical variations for the same entities, presents specific challenges to the NER task; thus, a domain-specific customized dataset is crucial for a…
▽ More
Over the last two decades, the development of the CoNLL-2003 named entity recognition (NER) dataset has helped enhance the capabilities of deep learning and natural language processing (NLP). The finance domain, characterized by its unique semantic and lexical variations for the same entities, presents specific challenges to the NER task; thus, a domain-specific customized dataset is crucial for advancing research in this field. In our work, we develop the first high-quality English Financial NER Open Research Dataset (FiNER-ORD). We benchmark multiple pre-trained language models (PLMs) and large-language models (LLMs) on FiNER-ORD. We believe our proposed FiNER-ORD dataset will open future opportunities to use FiNER-ORD as a benchmark for financial domain-specific NER and NLP tasks. Our dataset, models, and code are publicly available on GitHub and Hugging Face under CC BY-NC 4.0 license.
△ Less
Submitted 6 September, 2024; v1 submitted 22 February, 2023;
originally announced February 2023.
-
Benchmarking Machine Learning Models to Predict Corporate Bankruptcy
Authors:
Emmanuel Alanis,
Sudheer Chava,
Agam Shah
Abstract:
Using a comprehensive sample of 2,585 bankruptcies from 1990 to 2019, we benchmark the performance of various machine learning models in predicting financial distress of publicly traded U.S. firms. We find that gradient boosted trees outperform other models in one-year-ahead forecasts. Variable permutation tests show that excess stock returns, idiosyncratic risk, and relative size are the more imp…
▽ More
Using a comprehensive sample of 2,585 bankruptcies from 1990 to 2019, we benchmark the performance of various machine learning models in predicting financial distress of publicly traded U.S. firms. We find that gradient boosted trees outperform other models in one-year-ahead forecasts. Variable permutation tests show that excess stock returns, idiosyncratic risk, and relative size are the more important variables for predictions. Textual features derived from corporate filings do not improve performance materially. In a credit competition model that accounts for the asymmetric cost of default misclassification, the survival random forest is able to capture large dollar profits.
△ Less
Submitted 22 December, 2022;
originally announced December 2022.
-
WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain
Authors:
Raj Sanjay Shah,
Kunal Chawla,
Dheeraj Eidnani,
Agam Shah,
Wendi Du,
Sudheer Chava,
Natraj Raman,
Charese Smiley,
Jiaao Chen,
Diyi Yang
Abstract:
Pre-trained language models have shown impressive performance on a variety of tasks and domains. Previous research on financial language models usually employs a generic training scheme to train standard model architectures, without completely leveraging the richness of the financial data. We propose a novel domain specific Financial LANGuage model (FLANG) which uses financial keywords and phrases…
▽ More
Pre-trained language models have shown impressive performance on a variety of tasks and domains. Previous research on financial language models usually employs a generic training scheme to train standard model architectures, without completely leveraging the richness of the financial data. We propose a novel domain specific Financial LANGuage model (FLANG) which uses financial keywords and phrases for better masking, together with span boundary objective and in-filing objective. Additionally, the evaluation benchmarks in the field have been limited. To this end, we contribute the Financial Language Understanding Evaluation (FLUE), an open-source comprehensive suite of benchmarks for the financial domain. These include new benchmarks across 5 NLP tasks in financial domain as well as common benchmarks used in the previous research. Experiments on these benchmarks suggest that our model outperforms those in prior literature on a variety of NLP tasks. Our models, code and benchmark data are publicly available on Github and Huggingface.
△ Less
Submitted 31 October, 2022;
originally announced November 2022.
-
Cryptocurrency Bubble Detection: A New Stock Market Dataset, Financial Task & Hyperbolic Models
Authors:
Ramit Sawhney,
Shivam Agarwal,
Vivek Mittal,
Paolo Rosso,
Vikram Nanda,
Sudheer Chava
Abstract:
The rapid spread of information over social media influences quantitative trading and investments. The growing popularity of speculative trading of highly volatile assets such as cryptocurrencies and meme stocks presents a fresh challenge in the financial realm. Investigating such "bubbles" - periods of sudden anomalous behavior of markets are critical in better understanding investor behavior and…
▽ More
The rapid spread of information over social media influences quantitative trading and investments. The growing popularity of speculative trading of highly volatile assets such as cryptocurrencies and meme stocks presents a fresh challenge in the financial realm. Investigating such "bubbles" - periods of sudden anomalous behavior of markets are critical in better understanding investor behavior and market dynamics. However, high volatility coupled with massive volumes of chaotic social media texts, especially for underexplored assets like cryptocoins pose a challenge to existing methods. Taking the first step towards NLP for cryptocoins, we present and publicly release CryptoBubbles, a novel multi-span identification task for bubble detection, and a dataset of more than 400 cryptocoins from 9 exchanges over five years spanning over two million tweets. Further, we develop a set of sequence-to-sequence hyperbolic models suited to this multi-span identification task based on the power-law dynamics of cryptocurrencies and user behavior on social media. We further test the effectiveness of our models under zero-shot settings on a test set of Reddit posts pertaining to 29 "meme stocks", which see an increase in trade volume due to social media hype. Through quantitative, qualitative, and zero-shot analyses on Reddit and Twitter spanning cryptocoins and meme-stocks, we show the practical applicability of CryptoBubbles and hyperbolic models.
△ Less
Submitted 11 May, 2022;
originally announced June 2022.
-
Telechain: Bridging Telecom Policy and Blockchain Practice
Authors:
Sudheesh Singanamalla,
Apurv Mehra,
Nishanth Chandran,
Himanshi Lohchab,
Seshanuradha Chava,
Asit Kadayan,
Sunil Bajpai,
Kurtis Heimerl,
Richard Anderson,
Satya Lokam
Abstract:
The use of blockchain in regulatory ecosystems is a promising approach to address challenges of compliance among mutually untrusted entities. In this work, we consider applications of blockchain technologies in telecom regulations. In particular, we address growing concerns around Unsolicited Commercial Communication (UCC aka. spam) sent through text messages (SMS) and phone calls in India. Despit…
▽ More
The use of blockchain in regulatory ecosystems is a promising approach to address challenges of compliance among mutually untrusted entities. In this work, we consider applications of blockchain technologies in telecom regulations. In particular, we address growing concerns around Unsolicited Commercial Communication (UCC aka. spam) sent through text messages (SMS) and phone calls in India. Despite several regulatory measures taken to curb the menace of spam it continues to be a nuisance to subscribers while posing challenges to telecom operators and regulators alike.
In this paper, we present a consortium blockchain based architecture to address the problem of UCC in India. Our solution improves subscriber experiences, improves the efficiency of regulatory processes while also positively impacting all stakeholders in the telecom ecosystem. Unlike previous approaches to the problem of UCC, which are all ex-post, our approach to adherence to the regulations is ex-ante. The proposal described in this paper is a primary contributor to the revision of regulations concerning UCC and spam by the Telecom Regulatory Authority of India (TRAI). The new regulations published in July 2018 were first of a kind in the world and amended the 2010 Telecom Commercial Communication Customer Preference Regulation (TCCCPR), through mandating the use of a blockchain/distributed ledgers in addressing the UCC problem. In this paper, we provide a holistic account of of the projects' evolution from (1) its design and strategy, to (2) regulatory and policy action, (3) country wide implementation and deployment, and (4) evaluation and impact of the work.
△ Less
Submitted 24 May, 2022;
originally announced May 2022.
-
A Security Protocol for Multi-User Authentication
Authors:
Srikanth Chava
Abstract:
In this note we propose an encryption communication protocol which also provides database security. For the encryption of the data communication we use a transformation similar to the Cubic Public-key transformation. This method represents a many-to-one mapping which increases the complexity for any brute force attack. Some interesting properties of the transformation are also included which are…
▽ More
In this note we propose an encryption communication protocol which also provides database security. For the encryption of the data communication we use a transformation similar to the Cubic Public-key transformation. This method represents a many-to-one mapping which increases the complexity for any brute force attack. Some interesting properties of the transformation are also included which are basic in the authentication protocol.
△ Less
Submitted 11 April, 2008;
originally announced April 2008.