Skip to main content

Showing 1–10 of 10 results for author: Gonugondla, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2511.10507  [pdf, ps, other

    cs.CL

    AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following

    Authors: Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Karishma Mandyam, Sopan Khosla, Yuanhao Xiong, Nanshu Wang, Xiaoliang Peng, Beibin Li, Shengjie Bi, Shishir G. Patil, Qi Qi, Shengyu Feng, Julian Katz-Samuels, Richard Yuanzhe Pang, Sujan Gonugondla, Hunter Lang, Yue Yu, Yundi Qian, Maryam Fazel-Zarandi, Licheng Yu, Amine Benhalloum, Hany Awadalla, Manaal Faruqui

    Abstract: Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted instructions-remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpr… ▽ More

    Submitted 26 November, 2025; v1 submitted 13 November, 2025; originally announced November 2025.

  2. arXiv:2411.03786  [pdf, other

    cs.LG

    The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation

    Authors: Lawrence Stewart, Matthew Trager, Sujan Kumar Gonugondla, Stefano Soatto

    Abstract: Speculative decoding aims to speed up autoregressive generation of a language model by verifying in parallel the tokens generated by a smaller draft model.In this work, we explore the effectiveness of learning-free, negligible-cost draft strategies, namely $N$-grams obtained from the model weights and the context. While the predicted next token of the base model is rarely the top prediction of the… ▽ More

    Submitted 6 November, 2024; originally announced November 2024.

    Journal ref: ENLSP-IV 2024 - 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, Dec 2024, Vancouver, Canada

  3. arXiv:2410.01103  [pdf, ps, other

    cs.CL cs.AI

    Approximately Aligned Decoding

    Authors: Daniel Melcer, Sujan Gonugondla, Pramuditha Perera, Haifeng Qian, Wen-Hao Chiang, Yanjun Wang, Nihal Jain, Pranav Garg, Xiaofei Ma, Anoop Deoras

    Abstract: It is common to reject undesired outputs of Large Language Models (LLMs); however, current methods to do so require an excessive amount of computation to re-sample after a rejection, or distort the distribution of outputs by constraining the output to highly improbable tokens. We present a method, Approximately Aligned Decoding (AprAD), to balance the distortion of the output distribution with com… ▽ More

    Submitted 7 October, 2025; v1 submitted 1 October, 2024; originally announced October 2024.

    Comments: NeurIPS 2025 version; 10 pages, 35 total

  4. arXiv:2404.15778  [pdf, other

    cs.LG cs.CL

    BASS: Batched Attention-optimized Speculative Sampling

    Authors: Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, Anoop Deoras

    Abstract: Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges.… ▽ More

    Submitted 26 June, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

  5. arXiv:2403.08845  [pdf, other

    cs.LG cs.AI

    Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs

    Authors: Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Jiacheng Guo, Liangfu Chen, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, Bing Xiang

    Abstract: This study introduces bifurcated attention, a method designed to enhance language model inference in shared-context batch decoding scenarios. Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing to latency in high batch sizes and extended context lengths. Bifurcated attention achieves this by strategically dividing the attention mechanism during increme… ▽ More

    Submitted 11 July, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

  6. arXiv:2403.08688  [pdf, other

    cs.CL cs.AI

    Token Alignment via Character Matching for Subword Completion

    Authors: Ben Athiwaratkun, Shiqi Wang, Mingyue Shang, Yuchen Tian, Zijian Wang, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Rob Kwiatowski, Ramesh Nallapati, Bing Xiang

    Abstract: Generative models, widely utilized in various applications, can often struggle with prompts corresponding to partial tokens. This struggle stems from tokenization, where partial tokens fall out of distribution during inference, leading to incorrect or nonsensical outputs. This paper examines a technique to alleviate the tokenization artifact on text completion in generative models, maintaining per… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

  7. arXiv:2303.05378  [pdf, other

    cs.LG cs.SE

    Greener yet Powerful: Taming Large Code Generation Models with Quantization

    Authors: Xiaokai Wei, Sujan Gonugondla, Wasi Ahmad, Shiqi Wang, Baishakhi Ray, Haifeng Qian, Xiaopeng Li, Varun Kumar, Zijian Wang, Yuchen Tian, Qing Sun, Ben Athiwaratkun, Mingyue Shang, Murali Krishna Ramanathan, Parminder Bhatia, Bing Xiang

    Abstract: ML-powered code generation aims to assist developers to write code in a more productive manner, by intelligently generating code blocks based on natural language prompts. Recently, large pretrained deep learning models have substantially pushed the boundary of code generation and achieved impressive performance. Despite their great power, the huge number of model parameters poses a significant thr… ▽ More

    Submitted 9 March, 2023; originally announced March 2023.

    Comments: 10 pages, 7 figures, 10 tables

  8. arXiv:2210.14868  [pdf, other

    cs.LG cs.CL

    Multi-lingual Evaluation of Code Generation Models

    Authors: Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ramanathan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, Bing Xiang

    Abstract: We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. Using these benchmarks, we are able to assess the perform… ▽ More

    Submitted 28 March, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

    Comments: Code and data release: https://github.com/amazon-research/mxeval

  9. arXiv:2012.13645  [pdf, other

    cs.AR eess.SP

    Fundamental Limits on Energy-Delay-Accuracy of In-memory Architectures in Inference Applications

    Authors: Sujan Kumar Gonugondla, Charbel Sakr, Hassan Dbouk, Naresh R. Shanbhag

    Abstract: This paper obtains fundamental limits on the computational precision of in-memory computing architectures (IMCs). An IMC noise model and associated SNR metrics are defined and their interrelationships analyzed to show that the accuracy of IMCs is fundamentally limited by the compute SNR ($\text{SNR}_{\text{a}}$) of its analog core, and that activation, weight and output precision needs to be assig… ▽ More

    Submitted 25 December, 2020; originally announced December 2020.

    Comments: 14 pages, 13 figures

  10. arXiv:1610.07501  [pdf, other

    cs.AR

    A 481pJ/decision 3.4M decision/s Multifunctional Deep In-memory Inference Processor using Standard 6T SRAM Array

    Authors: Mingu Kang, Sujan Gonugondla, Ameya Patil, Naresh Shanbhag

    Abstract: This paper describes a multi-functional deep in-memory processor for inference applications. Deep in-memory processing is achieved by embedding pitch-matched low-SNR analog processing into a standard 6T 16KB SRAM array in 65 nm CMOS. Four applications are demonstrated. The prototype achieves up to 5.6X (9.7X estimated for multi-bank scenario) energy savings with negligible (<1%) accuracy degradati… ▽ More

    Submitted 24 October, 2016; originally announced October 2016.