Skip to main content

Showing 1–6 of 6 results for author: Sadhukhan, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2601.10639  [pdf, ps, other

    cs.LG

    STEM: Scaling Transformers with Embedding Modules

    Authors: Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, Beidi Chen

    Abstract: Fine-grained sparsity promises higher parametric capacity without proportional per-token compute, but often suffers from training instability, load balancing, and communication overhead. We introduce STEM (Scaling Transformers with Embedding Modules), a static, token-indexed approach that replaces the FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection… ▽ More

    Submitted 15 January, 2026; originally announced January 2026.

  2. arXiv:2506.05333  [pdf, ps, other

    cs.LG cs.CL

    Kinetics: Rethinking Test-Time Scaling Laws

    Authors: Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, Beidi Chen

    Abstract: We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new… ▽ More

    Submitted 19 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

  3. arXiv:2410.16179  [pdf, other

    cs.CL cs.LG

    MagicPIG: LSH Sampling for Efficient LLM Generation

    Authors: Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen

    Abstract: Large language models (LLMs) with long context windows have gained significant attention. However, the KV cache, stored to avoid re-computation, becomes a bottleneck. Various dynamic sparse or TopK-based attention approximation methods have been proposed to leverage the common insight that attention is sparse. In this paper, we first show that TopK attention itself suffers from quality degradation… ▽ More

    Submitted 18 December, 2024; v1 submitted 21 October, 2024; originally announced October 2024.

  4. arXiv:2408.11049  [pdf, other

    cs.CL

    MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

    Authors: Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, Vashisth Tiwari, Ruihang Lai, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Tianqi Chen, Beidi Chen

    Abstract: Large Language Models (LLMs) have become more prevalent in long-context applications such as interactive chatbots, document analysis, and agent workflows, but it is challenging to serve long-context requests with low latency and high throughput. Speculative decoding (SD) is a widely used technique to reduce latency losslessly, but the conventional wisdom suggests that its efficacy is limited to sm… ▽ More

    Submitted 1 April, 2025; v1 submitted 20 August, 2024; originally announced August 2024.

  5. arXiv:2407.12352  [pdf, other

    cs.CR cs.AI cs.AR

    SENTAUR: Security EnhaNced Trojan Assessment Using LLMs Against Undesirable Revisions

    Authors: Jitendra Bhandari, Rajat Sadhukhan, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri

    Abstract: A globally distributed IC supply chain brings risks due to untrusted third parties. The risks span inadvertent use of hardware Trojan (HT), inserted Intellectual Property (3P-IP) or Electronic Design Automation (EDA) flows. HT can introduce stealthy HT behavior, prevent an IC work as intended, or leak sensitive data via side channels. To counter HTs, rapidly examining HT scenarios is a key require… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  6. arXiv:2405.06394  [pdf, other

    cs.LG cs.AI cs.NE

    Memory Mosaics

    Authors: Jianyu Zhang, Niklas Nolte, Ranajoy Sadhukhan, Beidi Chen, Léon Bottou

    Abstract: Memory Mosaics are networks of associative memories working in concert to achieve a prediction task of interest. Like transformers, memory mosaics possess compositional capabilities and in-context learning capabilities. Unlike transformers, memory mosaics achieve these capabilities in comparatively transparent way ("predictive disentanglement"). We illustrate these capabilities on a toy example an… ▽ More

    Submitted 27 February, 2025; v1 submitted 10 May, 2024; originally announced May 2024.