-
RPU -- A Reasoning Processing Unit
Authors:
Matthew Adiletta,
Gu-Yeon Wei,
David Brooks
Abstract:
Large language model (LLM) inference performance is increasingly bottlenecked by the memory wall. While GPUs continue to scale raw compute throughput, they struggle to deliver scalable performance for memory bandwidth bound workloads. This challenge is amplified by emerging reasoning LLM applications, where long output sequences, low arithmetic intensity, and tight latency constraints demand signi…
▽ More
Large language model (LLM) inference performance is increasingly bottlenecked by the memory wall. While GPUs continue to scale raw compute throughput, they struggle to deliver scalable performance for memory bandwidth bound workloads. This challenge is amplified by emerging reasoning LLM applications, where long output sequences, low arithmetic intensity, and tight latency constraints demand significantly higher memory bandwidth. As a result, system utilization drops and energy per inference rises, highlighting the need for an optimized system architecture for scalable memory bandwidth.
To address these challenges we present the Reasoning Processing Unit (RPU), a chiplet-based architecture designed to address the challenges of the modern memory wall. RPU introduces: (1) A Capacity-Optimized High-Bandwidth Memory (HBM-CO) that trades capacity for lower energy and cost; (2) a scalable chiplet architecture featuring a bandwidth-first power and area provisioning design; and (3) a decoupled microarchitecture that separates memory, compute, and communication pipelines to sustain high bandwidth utilization. Simulation results show that RPU performs up to 45.3x lower latency and 18.6x higher throughput over an H100 system at ISO-TDP on Llama3-405B.
△ Less
Submitted 23 February, 2026; v1 submitted 20 February, 2026;
originally announced February 2026.
-
DreamRAM: A Fine-Grained Configurable Design Space Modeling Tool for Custom 3D Die-Stacked DRAM
Authors:
Victor Cai,
Jennifer Zhou,
Haebin Do,
David Brooks,
Gu-Yeon Wei
Abstract:
3D die-stacked DRAM has emerged as a key technology for delivering high bandwidth and high density for applications such as high-performance computing, graphics, and machine learning. However, different applications place diverse and sometimes diverging demands on power, performance, and area that cannot be universally satisfied with fixed commodity DRAM designs. Die stacking creates the opportuni…
▽ More
3D die-stacked DRAM has emerged as a key technology for delivering high bandwidth and high density for applications such as high-performance computing, graphics, and machine learning. However, different applications place diverse and sometimes diverging demands on power, performance, and area that cannot be universally satisfied with fixed commodity DRAM designs. Die stacking creates the opportunity for a large DRAM design space through 3D integration and expanded total die area. To open and navigate this expansive design space of customized memory architectures that cater to application-specific needs, we introduce DreamRAM, a configurable bandwidth, capacity, energy, latency, and area modeling tool for custom 3D die-stacked DRAM designs. DreamRAM exposes fine-grained design customization parameters at the MAT, subarray, bank, and inter-bank levels, including extensions of partial page and subarray parallelism proposals found in the literature, to open a large previously-unexplored design space. DreamRAM analytically models wire pitch, width, length, capacitance, and scaling parameters to capture the performance tradeoffs of physical layout and routing design choices. Routing awareness enables DreamRAM to model a custom MAT-level routing scheme, Dataline-Over-MAT (DLOMAT), to facilitate better bandwidth tradeoffs. DreamRAM is calibrated and validated against published industry HBM3 and HBM2E designs. Within DreamRAM's rich design space, we identify designs that achieve each of 66% higher bandwidth, 100% higher capacity, and 45% lower power and energy per bit compared to the baseline design, each on an iso-bandwidth, iso-capacity, and iso-power basis.
△ Less
Submitted 12 December, 2025;
originally announced December 2025.
-
Dark Energy Survey Year 3 results: Simulation-based $w$CDM inference from weak lensing and galaxy clustering maps with deep learning: Analysis design
Authors:
A. Thomsen,
J. Bucko,
T. Kacprzak,
V. Ajani,
J. Fluri,
A. Refregier,
D. Anbajagane,
F. J. Castander,
A. Ferté,
M. Gatti,
N. Jeffrey,
A. Alarcon,
A. Amon,
K. Bechtol,
M. R. Becker,
G. M. Bernstein,
A. Campos,
A. Carnero Rosell,
C. Chang,
R. Chen,
A. Choi,
M. Crocce,
C. Davis,
J. DeRose,
S. Dodelson
, et al. (77 additional authors not shown)
Abstract:
Data-driven approaches using deep learning are emerging as powerful techniques to extract non-Gaussian information from cosmological large-scale structure. This work presents the first simulation-based inference (SBI) pipeline that combines weak lensing and galaxy clustering maps in a realistic Dark Energy Survey Year 3 (DES Y3) configuration and serves as preparation for a forthcoming analysis of…
▽ More
Data-driven approaches using deep learning are emerging as powerful techniques to extract non-Gaussian information from cosmological large-scale structure. This work presents the first simulation-based inference (SBI) pipeline that combines weak lensing and galaxy clustering maps in a realistic Dark Energy Survey Year 3 (DES Y3) configuration and serves as preparation for a forthcoming analysis of the survey data. We develop a scalable forward model based on the CosmoGridV1 suite of N-body simulations to generate over one million self-consistent mock realizations of DES Y3 at the map level. Leveraging this large dataset, we train deep graph convolutional neural networks on the full survey footprint in spherical geometry to learn low-dimensional features that approximately maximize mutual information with target parameters. These learned compressions enable neural density estimation of the implicit likelihood via normalizing flows in a ten-dimensional parameter space spanning cosmological $w$CDM, intrinsic alignment, and linear galaxy bias parameters, while marginalizing over baryonic, photometric redshift, and shear bias nuisances. To ensure robustness, we extensively validate our inference pipeline using synthetic observations derived from both systematic contaminations in our forward model and independent Buzzard galaxy catalogs. Our forecasts yield significant improvements in cosmological parameter constraints, achieving $2-3\times$ higher figures of merit in the $Ω_m - S_8$ plane relative to our implementation of baseline two-point statistics and effectively breaking parameter degeneracies through probe combination. These results demonstrate the potential of SBI analyses powered by deep learning for upcoming Stage-IV wide-field imaging surveys.
△ Less
Submitted 18 February, 2026; v1 submitted 6 November, 2025;
originally announced November 2025.
-
PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training
Authors:
Alicia Golden,
Michael Kuchnik,
Samuel Hsia,
Zachary DeVito,
Gu-Yeon Wei,
David Brooks,
Carole-Jean Wu
Abstract:
Large model training beyond tens of thousands of GPUs is an uncharted territory. At such scales, disruptions to the training process are not a matter of if, but a matter of when -- a stochastic process degrading training productivity. Dynamic runtime variation will become increasingly more frequent as training scales up and GPUs are operated in increasingly power-limited and thermally-stressed env…
▽ More
Large model training beyond tens of thousands of GPUs is an uncharted territory. At such scales, disruptions to the training process are not a matter of if, but a matter of when -- a stochastic process degrading training productivity. Dynamic runtime variation will become increasingly more frequent as training scales up and GPUs are operated in increasingly power-limited and thermally-stressed environments. At the 64k GPU scale, we already observed 9% GPU time variability for frontier foundation model training. To understand potential causes of variability, we analyze GPU microbenchmarks at scale across a variety of platforms, showing up to 14% variation in GPU performance on GEMM workloads depending on training hardware and deployed environment.
Motivated by our analysis and the large design space around performance variability, we present PRISM -- a performance modeling framework that considers the stochastic nature of the large-scale distributed training. The core of PRISM is the statistical method that provides a quantifiable measure for probabilistic guarantees on training time. Using PRISM, we explore the design and optimization space of distributed training, from parallelization methods to next-generation training systems. PRISM is validated with real-system measurement, showing training time prediction accuracy with 20.8% Kolmogorov-Smirnov distance. Using PRISM, we demonstrate that, depending on computation node placement, up to 1.26x performance improvement potential is available if we factor in sensitivities of parallelization strategies to variation. In addition, we use PRISM to identify kernels to optimize for reducing performance variability and predict probability of slow-down for large-scale jobs where variation is magnified. We find optimizing communication kernels, such as AllGather and ReduceScatter, contribute most to minimizing variability in training step time.
△ Less
Submitted 17 October, 2025;
originally announced October 2025.
-
Transformer-Based Indirect Structural Health Monitoring of Rail Infrastructure with Attention-Driven Detection and Localization of Transient Defects
Authors:
Sizhe Ma,
Katherine A. Flanigan,
Mario Bergés,
James D. Brooks
Abstract:
Indirect structural health monitoring (iSHM) for broken rail detection using onboard sensors presents a cost-effective paradigm for railway track assessment, yet reliably detecting small, transient anomalies (2-10 cm) remains a significant challenge due to complex vehicle dynamics, signal noise, and the scarcity of labeled data limiting supervised approaches. This study addresses these issues thro…
▽ More
Indirect structural health monitoring (iSHM) for broken rail detection using onboard sensors presents a cost-effective paradigm for railway track assessment, yet reliably detecting small, transient anomalies (2-10 cm) remains a significant challenge due to complex vehicle dynamics, signal noise, and the scarcity of labeled data limiting supervised approaches. This study addresses these issues through unsupervised deep learning. We introduce an incremental synthetic data benchmark designed to systematically evaluate model robustness against progressively complex challenges like speed variations, multi-channel inputs, and realistic noise patterns encountered in iSHM. Using this benchmark, we evaluate several established unsupervised models alongside our proposed Attention-Focused Transformer. Our model employs a self-attention mechanism, trained via reconstruction but innovatively deriving anomaly scores primarily from deviations in learned attention weights, aiming for both effectiveness and computational efficiency. Benchmarking results reveal that while transformer-based models generally outperform others, all tested models exhibit significant vulnerability to high-frequency localized noise, identifying this as a critical bottleneck for practical deployment. Notably, our proposed model achieves accuracy comparable to the state-of-the-art solution while demonstrating better inference speed. This highlights the crucial need for enhanced noise robustness in future iSHM models and positions our more efficient attention-based approach as a promising foundation for developing practical onboard anomaly detection systems.
△ Less
Submitted 8 October, 2025;
originally announced October 2025.
-
MIRAGE: Patient-Specific Mixed Reality Coaching for MRI via Depth-Only Markerless Registration and Immersive VR
Authors:
Daniel Brooks,
Emily Carter,
Hu Guo,
Rajesh Nair
Abstract:
Magnetic resonance imaging (MRI) is an indispensable diagnostic tool, yet the confined bore and acoustic noise can evoke considerable anxiety and claustrophobic reactions. High anxiety leads to motion artifacts, incomplete scans and reliance on pharmacological sedation. MIRAGE (Mixed Reality Anxiety Guidance Environment) harnesses the latest mixed reality (MR) hardware to prepare patients for MRI…
▽ More
Magnetic resonance imaging (MRI) is an indispensable diagnostic tool, yet the confined bore and acoustic noise can evoke considerable anxiety and claustrophobic reactions. High anxiety leads to motion artifacts, incomplete scans and reliance on pharmacological sedation. MIRAGE (Mixed Reality Anxiety Guidance Environment) harnesses the latest mixed reality (MR) hardware to prepare patients for MRI through immersive virtual reality (VR) and markerless augmented reality (AR) registration. In this paper, we extend our previous work by providing a comprehensive review of related research, detailing the system architecture, and exploring metrics for patient and clinician experience. We also present considerations for clinical deployment of MR systems within hospital workflows. Our results indicate that depth-based registration achieves sub-centimeter accuracy with minimal setup, while the immersive coaching environment reduces patient anxiety and yields favourable usability scores.
△ Less
Submitted 2 October, 2025;
originally announced October 2025.
-
The Energy Cost of Reasoning: Analyzing Energy Usage in LLMs with Test-time Compute
Authors:
Yunho Jin,
Gu-Yeon Wei,
David Brooks
Abstract:
Scaling large language models (LLMs) has driven significant advancements, yet it faces diminishing returns and escalating energy demands. This work explores how test-time compute (TTC) can serve as an energy-efficient complement to conventional scaling strategies by allocating additional computational resources at inference time rather than during training. Specifically, we investigate whether emp…
▽ More
Scaling large language models (LLMs) has driven significant advancements, yet it faces diminishing returns and escalating energy demands. This work explores how test-time compute (TTC) can serve as an energy-efficient complement to conventional scaling strategies by allocating additional computational resources at inference time rather than during training. Specifically, we investigate whether employing TTC can achieve superior accuracy-energy trade-offs compared to simply increasing model size. Our empirical analysis reveals that TTC surpasses traditional model scaling in accuracy/energy efficiency, with notable gains in tasks demanding complex reasoning rather than mere factual recall. Further, we identify a critical interaction between TTC performance and output sequence length, demonstrating that strategically adjusting compute resources at inference time according to query complexity can substantially enhance efficiency. Our findings advocate for TTC as a promising direction, enabling more sustainable, accurate, and adaptable deployment of future language models.
△ Less
Submitted 9 November, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
Modeling PFAS in Semiconductor Manufacturing to Quantify Trade-offs in Energy Efficiency and Environmental Impact of Computing Systems
Authors:
Mariam Elgamal,
Abdulrahman Mahmoud,
Gu-Yeon Wei,
David Brooks,
Gage Hills
Abstract:
The electronics and semiconductor industry is a prominent consumer of per- and poly-fluoroalkyl substances (PFAS), also known as forever chemicals. PFAS are persistent in the environment and can bioaccumulate to ecological and human toxic levels. Computer designers have an opportunity to reduce the use of PFAS in semiconductors and electronics manufacturing, including integrated circuits (IC), bat…
▽ More
The electronics and semiconductor industry is a prominent consumer of per- and poly-fluoroalkyl substances (PFAS), also known as forever chemicals. PFAS are persistent in the environment and can bioaccumulate to ecological and human toxic levels. Computer designers have an opportunity to reduce the use of PFAS in semiconductors and electronics manufacturing, including integrated circuits (IC), batteries, displays, etc., which currently account for a staggering 10% of the total PFAS fluoropolymers usage in Europe alone. In this paper, we present a framework where we (1) quantify the environmental impact of PFAS in computing systems manufacturing with granular consideration of the metal layer stack and patterning complexities in IC manufacturing at the design phase, (2) identify contending trends between embodied carbon (carbon footprint due to hardware manufacturing) versus PFAS. For example, manufacturing an IC at a 7 nm technology node using EUV lithography uses 18% less PFAS-containing layers, compared to manufacturing the same IC at a 7 nm technology node using DUV immersion lithography (instead of EUV) unlike embodied carbon trends, and (3) conduct case studies to illustrate how to optimize and trade-off designs with lower PFAS, while meeting power-performance-area constraints. We show that optimizing designs to use less back-end-of-line (BEOL) metal stack layers can save 1.7$\times$ PFAS-containing layers in systolic arrays.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
FlexQuant: Elastic Quantization Framework for Locally Hosted LLM on Edge Devices
Authors:
Yuji Chai,
Mujin Kwen,
David Brooks,
Gu-Yeon Wei
Abstract:
Deploying LLMs on edge devices presents serious technical challenges. Memory elasticity is crucial for edge devices with unified memory, where memory is shared and fluctuates dynamically. Existing solutions suffer from either poor transition granularity or high storage costs. We propose FlexQuant, a novel elasticity framework that generates an ensemble of quantized models, providing an elastic hos…
▽ More
Deploying LLMs on edge devices presents serious technical challenges. Memory elasticity is crucial for edge devices with unified memory, where memory is shared and fluctuates dynamically. Existing solutions suffer from either poor transition granularity or high storage costs. We propose FlexQuant, a novel elasticity framework that generates an ensemble of quantized models, providing an elastic hosting solution with 15x granularity improvement and 10x storage reduction compared to SoTA methods. FlexQuant works with most quantization methods and creates a family of trade-off options under various storage limits through our pruning method. It brings great performance and flexibility to the edge deployment of LLMs.
△ Less
Submitted 13 January, 2025;
originally announced January 2025.
-
Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models
Authors:
Yun-Chen Lo,
Gu-Yeon Wei,
David Brooks
Abstract:
As cutting-edge large language models (LLMs) continue to transform various industries, their fast-growing model size and sequence length have led to memory traffic and capacity challenges. Recently, AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm have proposed a Microscaling standard (Mx), which augments block floating-point with microexponents to achieve promising perplexity-to-footprint t…
▽ More
As cutting-edge large language models (LLMs) continue to transform various industries, their fast-growing model size and sequence length have led to memory traffic and capacity challenges. Recently, AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm have proposed a Microscaling standard (Mx), which augments block floating-point with microexponents to achieve promising perplexity-to-footprint trade-offs. However, the Microscaling suffers from significant perplexity degradation on modern LLMs with less than six bits. This paper profiles modern LLMs and identifies three main challenges of low-bit Microscaling format, i.e., inaccurate tracking of outliers, vacant quantization levels, and wasted binary code. In response, Nanoscaling (NxFP) proposes three techniques, i.e., NanoMantissa, Adaptive Microexponent, and Code Recycling to enable better accuracy and smaller memory footprint than state-of-the-art MxFP. Experimental results on direct-cast inference across various modern LLMs demonstrate that our proposed methods outperform state-of-the-art MxFP by up to 0.64 in perplexity and by up to 30% in accuracy on MMLU benchmarks. Furthermore, NxFP reduces memory footprint by up to 16% while achieving comparable perplexity as MxFP.
△ Less
Submitted 15 December, 2024;
originally announced December 2024.
-
Carbon Connect: An Ecosystem for Sustainable Computing
Authors:
Benjamin C. Lee,
David Brooks,
Arthur van Benthem,
Udit Gupta,
Gage Hills,
Vincent Liu,
Benjamin Pierce,
Christopher Stewart,
Emma Strubell,
Gu-Yeon Wei,
Adam Wierman,
Yuan Yao,
Minlan Yu
Abstract:
Computing is at a moment of profound opportunity. Emerging applications -- such as capable artificial intelligence, immersive virtual realities, and pervasive sensor systems -- drive unprecedented demand for computer. Despite recent advances toward net zero carbon emissions, the computing industry's gross energy usage continues to rise at an alarming rate, outpacing the growth of new energy instal…
▽ More
Computing is at a moment of profound opportunity. Emerging applications -- such as capable artificial intelligence, immersive virtual realities, and pervasive sensor systems -- drive unprecedented demand for computer. Despite recent advances toward net zero carbon emissions, the computing industry's gross energy usage continues to rise at an alarming rate, outpacing the growth of new energy installations and renewable energy deployments. A shift towards sustainability is needed to spark a transformation in how computer systems are manufactured, allocated, and consumed. Carbon Connect envisions coordinated research thrusts that produce design and management strategies for sustainable, next-generation computer systems. These strategies must flatten and then reverse growth trajectories for computing power and carbon for society's most rapidly growing applications such as artificial intelligence and virtual spaces. We will require accurate models for carbon accounting in computing technology. For embodied carbon, we must re-think conventional design strategies -- over-provisioned monolithic servers, frequent hardware refresh cycles, custom silicon -- and adopt life-cycle design strategies that more effectively reduce, reuse and recycle hardware at scale. For operational carbon, we must not only embrace renewable energy but also design systems to use that energy more efficiently. Finally, new hardware design and management strategies must be cognizant of economic policy and regulatory landscape, aligning private initiatives with societal goals. Many of these broader goals will require computer scientists to develop deep, enduring collaborations with researchers in economics, law, and industrial ecology to spark change in broader practice.
△ Less
Submitted 21 August, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Is Flash Attention Stable?
Authors:
Alicia Golden,
Samuel Hsia,
Fei Sun,
Bilge Acun,
Basil Hosmer,
Yejin Lee,
Zachary DeVito,
Jeff Johnson,
Gu-Yeon Wei,
David Brooks,
Carole-Jean Wu
Abstract:
Training large-scale machine learning models poses distinct system challenges, given both the size and complexity of today's workloads. Recently, many organizations training state-of-the-art Generative AI models have reported cases of instability during training, often taking the form of loss spikes. Numeric deviation has emerged as a potential cause of this training instability, although quantify…
▽ More
Training large-scale machine learning models poses distinct system challenges, given both the size and complexity of today's workloads. Recently, many organizations training state-of-the-art Generative AI models have reported cases of instability during training, often taking the form of loss spikes. Numeric deviation has emerged as a potential cause of this training instability, although quantifying this is especially challenging given the costly nature of training runs. In this work, we develop a principled approach to understanding the effects of numeric deviation, and construct proxies to put observations into context when downstream effects are difficult to quantify. As a case study, we apply this framework to analyze the widely-adopted Flash Attention optimization. We find that Flash Attention sees roughly an order of magnitude more numeric deviation as compared to Baseline Attention at BF16 when measured during an isolated forward pass. We then use a data-driven analysis based on the Wasserstein Distance to provide upper bounds on how this numeric deviation impacts model weights during training, finding that the numerical deviation present in Flash Attention is 2-5 times less significant than low-precision training.
△ Less
Submitted 4 May, 2024;
originally announced May 2024.
-
Guac: Energy-Aware and SSA-Based Generation of Coarse-Grained Merged Accelerators from LLVM-IR
Authors:
Iulian Brumar,
Rodrigo Rocha,
Alex Bernat,
Devashree Tripathy,
David Brooks,
Gu-Yeon Wei
Abstract:
Designing accelerators for resource- and power-constrained applications is a daunting task. High-level Synthesis (HLS) addresses these constraints through resource sharing, an optimization at the HLS binding stage that maps multiple operations to the same functional unit.
However, resource sharing is often limited to reusing instructions within a basic block. Instead of searching globally for th…
▽ More
Designing accelerators for resource- and power-constrained applications is a daunting task. High-level Synthesis (HLS) addresses these constraints through resource sharing, an optimization at the HLS binding stage that maps multiple operations to the same functional unit.
However, resource sharing is often limited to reusing instructions within a basic block. Instead of searching globally for the best control and dataflow graphs (CDFGs) to combine, it is constrained by existing instruction mappings and schedules.
Coarse-grained function merging (CGFM) at the intermediate representation (IR) level can reuse control and dataflow patterns without dealing with the post-scheduling complexity of mapping operations onto functional units, wires, and registers. The merged functions produced by CGFM can be translated to RTL by HLS, yielding Coarse Grained Merged Accelerators (CGMAs). CGMAs are especially profitable across applications with similar data- and control-flow patterns. Prior work has used CGFM to generate CGMAs without regard for which CGFM algorithms best optimize area, power, and energy costs.
We propose Guac, an energy-aware and SSA-based (static single assignment) CGMA generation methodology. Guac implements a novel ensemble of cost models for efficient CGMA generation. We also show that CGFM algorithms using SSA form to merge control- and dataflow graphs outperform prior non-SSA CGFM designs. We demonstrate significant area, power, and energy savings with respect to the state of the art. In particular, Guac more than doubles energy savings with respect to the closest related work while using a strong resource-sharing baseline.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
Personalizing Driver Safety Interfaces via Driver Cognitive Factors Inference
Authors:
Emily S Sumner,
Jonathan DeCastro,
Jean Costa,
Deepak E Gopinath,
Everlyne Kimani,
Shabnam Hakimi,
Allison Morgan,
Andrew Best,
Hieu Nguyen,
Daniel J Brooks,
Bassam ul Haq,
Andrew Patrikalakis,
Hiroshi Yasuda,
Kate Sieck,
Avinash Balachandran,
Tiffany Chen,
Guy Rosman
Abstract:
Recent advances in AI and intelligent vehicle technology hold promise to revolutionize mobility and transportation, in the form of advanced driving assistance (ADAS) interfaces. Although it is widely recognized that certain cognitive factors, such as impulsivity and inhibitory control, are related to risky driving behavior, play a significant role in on-road risk-taking, existing systems fail to l…
▽ More
Recent advances in AI and intelligent vehicle technology hold promise to revolutionize mobility and transportation, in the form of advanced driving assistance (ADAS) interfaces. Although it is widely recognized that certain cognitive factors, such as impulsivity and inhibitory control, are related to risky driving behavior, play a significant role in on-road risk-taking, existing systems fail to leverage such factors. Varying levels of these cognitive factors could influence the effectiveness and acceptance of driver safety interfaces.
We demonstrate an approach for personalizing driver interaction via driver safety interfaces that are triggered based on a learned recurrent neural network. The network is trained from a population of human drivers to infer impulsivity and inhibitory control from recent driving behavior. Using a high-fidelity vehicle motion simulator, we demonstrate the ability to deduce these factors from driver behavior. We then use these inferred factors to make instantaneous determinations on whether or not to engage a driver safety interface. This interface aims to decrease a driver's speed during yellow lights and reduce their inclination to run through them.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Flash: A Hybrid Private Inference Protocol for Deep CNNs with High Accuracy and Low Latency on CPU
Authors:
Hyeri Roh,
Jinsu Yeo,
Yeongil Ko,
Gu-Yeon Wei,
David Brooks,
Woo-Seok Choi
Abstract:
This paper presents Flash, an optimized private inference (PI) hybrid protocol utilizing both homomorphic encryption (HE) and secure two-party computation (2PC), which can reduce the end-to-end PI latency for deep CNN models less than 1 minute with CPU. To this end, first, Flash proposes a low-latency convolution algorithm built upon a fast slot rotation operation and a novel data encoding scheme,…
▽ More
This paper presents Flash, an optimized private inference (PI) hybrid protocol utilizing both homomorphic encryption (HE) and secure two-party computation (2PC), which can reduce the end-to-end PI latency for deep CNN models less than 1 minute with CPU. To this end, first, Flash proposes a low-latency convolution algorithm built upon a fast slot rotation operation and a novel data encoding scheme, which results in 4-94x performance gain over the state-of-the-art. Second, to minimize the communication cost introduced by the standard nonlinear activation function ReLU, Flash replaces the entire ReLUs with the polynomial $x^2+x$ and trains deep CNN models with the new training strategy. The trained models improve the inference accuracy for CIFAR-10/100 and TinyImageNet by 16% on average (up to 40% for ResNet-32) compared to prior art. Last, Flash proposes an efficient 2PC-based $x^2+x$ evaluation protocol that does not require any offline communication and that reduces the total communication cost to process the activation layer by 84-196x over the state-of-the-art. As a result, the end-to-end PI latency of Flash implemented on CPU is 0.02 minute for CIFAR-100 and 0.57 minute for TinyImageNet classification, while the total data communication is 0.07GB for CIFAR-100 and 0.22GB for TinyImageNet. Flash improves the state-of-the-art PI by 16-45x in latency and 84-196x in communication cost. Moreover, even for ImageNet, Flash can deliver the latency less than 1 minute on CPU with the total communication less than 1GB.
△ Less
Submitted 17 January, 2025; v1 submitted 29 January, 2024;
originally announced January 2024.
-
Generative AI Beyond LLMs: System Implications of Multi-Modal Generation
Authors:
Alicia Golden,
Samuel Hsia,
Fei Sun,
Bilge Acun,
Basil Hosmer,
Yejin Lee,
Zachary DeVito,
Jeff Johnson,
Gu-Yeon Wei,
David Brooks,
Carole-Jean Wu
Abstract:
As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation m…
▽ More
As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation models. Current model architecture designs are bifurcated into 2 categories: Diffusion- and Transformer-based models. Our systematic performance characterization on a suite of eight representative TTI/TTV models shows that after state-of-the-art optimization techniques such as Flash Attention are applied, Convolution accounts for up to 44% of execution time for Diffusion-based TTI models, while Linear layers consume up to 49% of execution time for Transformer-based models. We additionally observe that Diffusion-based TTI models resemble the Prefill stage of LLM inference, and benefit from 1.1-2.5x greater speedup from Flash Attention than Transformer-based TTI models that resemble the Decode phase. Since optimizations designed for LLMs do not map directly onto TTI/TTV models, we must conduct a thorough characterization of these workloads to gain insights for new optimization opportunities. In doing so, we define sequence length in the context of TTI/TTV models and observe sequence length can vary up to 4x in Diffusion model inference. We additionally observe temporal aspects of TTV workloads pose unique system bottlenecks, with Temporal Attention accounting for over 60% of total Attention time. Overall, our in-depth system performance characterization is a critical first step towards designing efficient and deployable systems for emerging TTI/TTV workloads.
△ Less
Submitted 5 May, 2024; v1 submitted 21 December, 2023;
originally announced December 2023.
-
Hardware Resilience Properties of Text-Guided Image Classifiers
Authors:
Syed Talal Wasim,
Kabila Haile Soboka,
Abdulrahman Mahmoud,
Salman Khan,
David Brooks,
Gu-Yeon Wei
Abstract:
This paper presents a novel method to enhance the reliability of image classification models during deployment in the face of transient hardware errors. By utilizing enriched text embeddings derived from GPT-3 with question prompts per class and CLIP pretrained text encoder, we investigate their impact as an initialization for the classification layer. Our approach achieves a remarkable…
▽ More
This paper presents a novel method to enhance the reliability of image classification models during deployment in the face of transient hardware errors. By utilizing enriched text embeddings derived from GPT-3 with question prompts per class and CLIP pretrained text encoder, we investigate their impact as an initialization for the classification layer. Our approach achieves a remarkable $5.5\times$ average increase in hardware reliability (and up to $14\times$) across various architectures in the most critical layer, with minimal accuracy drop ($0.3\%$ on average) compared to baseline PyTorch models. Furthermore, our method seamlessly integrates with any image classification backbone, showcases results across various network architectures, decreases parameter and FLOPs overhead, and follows a consistent training recipe. This research offers a practical and efficient solution to bolster the robustness of image classification models against hardware failures, with potential implications for future studies in this domain. Our code and models are released at https://github.com/TalalWasim/TextGuidedResilience.
△ Less
Submitted 5 December, 2023; v1 submitted 23 November, 2023;
originally announced November 2023.
-
Carbon Responder: Coordinating Demand Response for the Datacenter Fleet
Authors:
Jiali Xing,
Bilge Acun,
Aditya Sundarrajan,
David Brooks,
Manoj Chakkaravarthy,
Nikky Avila,
Carole-Jean Wu,
Benjamin C. Lee
Abstract:
The increasing integration of renewable energy sources results in fluctuations in carbon intensity throughout the day. To mitigate their carbon footprint, datacenters can implement demand response (DR) by adjusting their load based on grid signals. However, this presents challenges for private datacenters with diverse workloads and services. One of the key challenges is efficiently and fairly allo…
▽ More
The increasing integration of renewable energy sources results in fluctuations in carbon intensity throughout the day. To mitigate their carbon footprint, datacenters can implement demand response (DR) by adjusting their load based on grid signals. However, this presents challenges for private datacenters with diverse workloads and services. One of the key challenges is efficiently and fairly allocating power curtailment across different workloads. In response to these challenges, we propose the Carbon Responder framework.
The Carbon Responder framework aims to reduce the carbon footprint of heterogeneous workloads in datacenters by modulating their power usage. Unlike previous studies, Carbon Responder considers both online and batch workloads with different service level objectives and develops accurate performance models to achieve performance-aware power allocation. The framework supports three alternative policies: Efficient DR, Fair and Centralized DR, and Fair and Decentralized DR. We evaluate Carbon Responder polices using production workload traces from a private hyperscale datacenter. Our experimental results demonstrate that the efficient Carbon Responder policy reduces the carbon footprint by around 2x as much compared to baseline approaches adapted from existing methods. The fair Carbon Responder policies distribute the performance penalties and carbon reduction responsibility fairly among workloads.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems
Authors:
Samuel Hsia,
Alicia Golden,
Bilge Acun,
Newsha Ardalani,
Zachary DeVito,
Gu-Yeon Wei,
David Brooks,
Carole-Jean Wu
Abstract:
Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding commun…
▽ More
Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities. Through the application of MAD-Max to a suite of real-world large-scale ML models on state-of-the-art GPU clusters, we showcase potential throughput enhancements of up to 2.24x for pre-training and up to 5.2x for inference scenarios, respectively.
△ Less
Submitted 10 June, 2024; v1 submitted 4 October, 2023;
originally announced October 2023.
-
Guess & Sketch: Language Model Guided Transpilation
Authors:
Celine Lee,
Abdulrahman Mahmoud,
Michal Kurek,
Simone Campanoni,
David Brooks,
Stephen Chong,
Gu-Yeon Wei,
Alexander M. Rush
Abstract:
Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Lea…
▽ More
Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.
△ Less
Submitted 15 March, 2024; v1 submitted 25 September, 2023;
originally announced September 2023.
-
Critical Evaluation of Artificial Intelligence as Digital Twin of Pathologist for Prostate Cancer Pathology
Authors:
Okyaz Eminaga,
Mahmoud Abbas,
Christian Kunder,
Yuri Tolkach,
Ryan Han,
James D. Brooks,
Rosalie Nolley,
Axel Semjonow,
Martin Boegemann,
Robert West,
Jin Long,
Richard Fan,
Olaf Bettendorf
Abstract:
Prostate cancer pathology plays a crucial role in clinical management but is time-consuming. Artificial intelligence (AI) shows promise in detecting prostate cancer and grading patterns. We tested an AI-based digital twin of a pathologist, vPatho, on 2,603 histology images of prostate tissue stained with hematoxylin and eosin. We analyzed various factors influencing tumor-grade disagreement betwee…
▽ More
Prostate cancer pathology plays a crucial role in clinical management but is time-consuming. Artificial intelligence (AI) shows promise in detecting prostate cancer and grading patterns. We tested an AI-based digital twin of a pathologist, vPatho, on 2,603 histology images of prostate tissue stained with hematoxylin and eosin. We analyzed various factors influencing tumor-grade disagreement between vPatho and six human pathologists. Our results demonstrated that vPatho achieved comparable performance in prostate cancer detection and tumor volume estimation, as reported in the literature. Concordance levels between vPatho and human pathologists were examined. Notably, moderate to substantial agreement was observed in identifying complementary histological features such as ductal, cribriform, nerve, blood vessels, and lymph cell infiltrations. However, concordance in tumor grading showed a decline when applied to prostatectomy specimens (kappa = 0.44) compared to biopsy cores (kappa = 0.70). Adjusting the decision threshold for the secondary Gleason pattern from 5% to 10% improved the concordance level between pathologists and vPatho for tumor grading on prostatectomy specimens (kappa from 0.44 to 0.64). Potential causes of grade discordance included the vertical extent of tumors toward the prostate boundary and the proportions of slides with prostate cancer. Gleason pattern 4 was particularly associated with discordance. Notably, grade discordance with vPatho was not specific to any of the six pathologists involved in routine clinical grading. In conclusion, our study highlights the potential utility of AI in developing a digital twin of a pathologist. This approach can help uncover limitations in AI adoption and the current grading system for prostate cancer pathology.
△ Less
Submitted 23 August, 2023;
originally announced August 2023.
-
Local primordial non-Gaussianity from the large-scale clustering of photometric DESI luminous red galaxies
Authors:
Mehdi Rezaie,
Ashley J. Ross,
Hee-Jong Seo,
Hui Kong,
Anna Porredon,
Lado Samushia,
Edmond Chaussidon,
Alex Krolewski,
Arnaud de Mattia,
Florian Beutler,
Jessica Nicole Aguilar,
Steven Ahlen,
Shadab Alam,
Santiago Avila,
Benedict Bahr-Kalus,
Jose Bermejo-Climent,
David Brooks,
Todd Claybaugh,
Shaun Cole,
Kyle Dawson,
Axel de la Macorra,
Peter Doel,
Andreu Font-Ribera,
Jaime E. Forero-Romero,
Satya Gontcho A Gontcho
, et al. (24 additional authors not shown)
Abstract:
We use angular clustering of luminous red galaxies from the Dark Energy Spectroscopic Instrument (DESI) imaging surveys to constrain the local primordial non-Gaussianity parameter $\fnl$. Our sample comprises over 12 million targets, covering 14,000 square degrees of the sky, with redshifts in the range $0.2< z < 1.35$. We identify Galactic extinction, survey depth, and astronomical seeing as the…
▽ More
We use angular clustering of luminous red galaxies from the Dark Energy Spectroscopic Instrument (DESI) imaging surveys to constrain the local primordial non-Gaussianity parameter $\fnl$. Our sample comprises over 12 million targets, covering 14,000 square degrees of the sky, with redshifts in the range $0.2< z < 1.35$. We identify Galactic extinction, survey depth, and astronomical seeing as the primary sources of systematic error, and employ linear regression and artificial neural networks to alleviate non-cosmological excess clustering on large scales. Our methods are tested against simulations with and without $\fnl$ and systematics, showing superior performance of the neural network treatment. The neural network with a set of nine imaging property maps passes our systematic null test criteria, and is chosen as the fiducial treatment. Assuming the universality relation, we find $\fnl = 34^{+24(+50)}_{-44(-73)}$ at 68\%(95\%) confidence. We apply a series of robustness tests (e.g., cuts on imaging, declination, or scales used) that show consistency in the obtained constraints. We study how the regression method biases the measured angular power-spectrum and degrades the $\fnl$ constraining power. The use of the nine maps more than doubles the uncertainty compared to using only the three primary maps in the regression. Our results thus motivate the development of more efficient methods that avoid over-correction, protect large-scale clustering information, and preserve constraining power. Additionally, our results encourage further studies of $\fnl$ with DESI spectroscopic samples, where the inclusion of 3D clustering modes should help separate imaging systematics and lessen the degradation in the $\fnl$ uncertainty.
△ Less
Submitted 25 June, 2024; v1 submitted 4 July, 2023;
originally announced July 2023.
-
INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation
Authors:
Yuji Chai,
John Gkountouras,
Glenn G. Ko,
David Brooks,
Gu-Yeon Wei
Abstract:
We introduce a method that dramatically reduces fine-tuning VRAM requirements and rectifies quantization errors in quantized Large Language Models. First, we develop an extremely memory-efficient fine-tuning (EMEF) method for quantized models using Low-Rank Adaptation (LoRA), and drawing upon it, we construct an error-correcting algorithm designed to minimize errors induced by the quantization pro…
▽ More
We introduce a method that dramatically reduces fine-tuning VRAM requirements and rectifies quantization errors in quantized Large Language Models. First, we develop an extremely memory-efficient fine-tuning (EMEF) method for quantized models using Low-Rank Adaptation (LoRA), and drawing upon it, we construct an error-correcting algorithm designed to minimize errors induced by the quantization process. Our method reduces the memory requirements by up to 5.6 times, which enables fine-tuning a 7 billion parameter Large Language Model (LLM) on consumer laptops. At the same time, we propose a Low-Rank Error Correction (LREC) method that exploits the added LoRA layers to ameliorate the gap between the quantized model and its float point counterpart. Our error correction framework leads to a fully functional INT2 quantized LLM with the capacity to generate coherent English text. To the best of our knowledge, this is the first INT2 Large Language Model that has been able to reach such a performance. The overhead of our method is merely a 1.05 times increase in model size, which translates to an effective precision of INT2.1. Also, our method readily generalizes to other quantization standards, such as INT3, INT4, and INT8, restoring their lost performance, which marks a significant milestone in the field of model quantization. The strategies delineated in this paper hold promising implications for the future development and optimization of quantized models, marking a pivotal shift in the landscape of low-resource machine learning computations.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
S$^{3}$: Increasing GPU Utilization during Generative Inference for Higher Throughput
Authors:
Yunho Jin,
Chun-Feng Wu,
David Brooks,
Gu-Yeon Wei
Abstract:
Generating texts with a large language model (LLM) consumes massive amounts of memory. Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for th…
▽ More
Generating texts with a large language model (LLM) consumes massive amounts of memory. Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem. To this end, we propose S$^{3}$, which predicts the output sequence length, schedules generation queries based on the prediction to increase device resource utilization and throughput, and handle mispredictions. Our proposed method achieves 6.49$\times$ throughput over those systems that assume the worst case for the output sequence length.
△ Less
Submitted 9 June, 2023;
originally announced June 2023.
-
CAMEL: Co-Designing AI Models and Embedded DRAMs for Efficient On-Device Learning
Authors:
Sai Qian Zhang,
Thierry Tambe,
Nestor Cuevas,
Gu-Yeon Wei,
David Brooks
Abstract:
On-device learning allows AI models to adapt to user data, thereby enhancing service quality on edge platforms. However, training AI on resource-limited devices poses significant challenges due to the demanding computing workload and the substantial memory consumption and data access required by deep neural networks (DNNs). To address these issues, we propose utilizing embedded dynamic random-acce…
▽ More
On-device learning allows AI models to adapt to user data, thereby enhancing service quality on edge platforms. However, training AI on resource-limited devices poses significant challenges due to the demanding computing workload and the substantial memory consumption and data access required by deep neural networks (DNNs). To address these issues, we propose utilizing embedded dynamic random-access memory (eDRAM) as the primary storage medium for transient training data. In comparison to static random-access memory (SRAM), eDRAM provides higher storage density and lower leakage power, resulting in reduced access cost and power leakage. Nevertheless, to maintain the integrity of the stored data, periodic power-hungry refresh operations could potentially degrade system performance.
To minimize the occurrence of expensive eDRAM refresh operations, it is beneficial to shorten the lifetime of stored data during the training process. To achieve this, we adopt the principles of algorithm and hardware co-design, introducing a family of reversible DNN architectures that effectively decrease data lifetime and storage costs throughout training. Additionally, we present a highly efficient on-device training engine named \textit{CAMEL}, which leverages eDRAM as the primary on-chip memory. This engine enables efficient on-device training with significantly reduced memory usage and off-chip DRAM traffic while maintaining superior training accuracy. We evaluate our CAMEL system on multiple DNNs with different datasets, demonstrating a $2.5\times$ speedup of the training process and $2.8\times$ training energy savings than the other baseline hardware platforms.
△ Less
Submitted 22 December, 2023; v1 submitted 4 May, 2023;
originally announced May 2023.
-
Design Space Exploration and Optimization for Carbon-Efficient Extended Reality Systems
Authors:
Mariam Elgamal,
Doug Carmean,
Elnaz Ansari,
Okay Zed,
Ramesh Peri,
Srilatha Manne,
Udit Gupta,
Gu-Yeon Wei,
David Brooks,
Gage Hills,
Carole-Jean Wu
Abstract:
As computing hardware becomes more specialized, designing environmentally sustainable computing systems requires accounting for both hardware and software parameters. Our goal is to design low carbon computing systems while maintaining a competitive level of performance and operational efficiency. Despite previous carbon modeling efforts for computing systems, there is a distinct lack of holistic…
▽ More
As computing hardware becomes more specialized, designing environmentally sustainable computing systems requires accounting for both hardware and software parameters. Our goal is to design low carbon computing systems while maintaining a competitive level of performance and operational efficiency. Despite previous carbon modeling efforts for computing systems, there is a distinct lack of holistic design strategies to simultaneously optimize for carbon, performance, power and energy. In this work, we take a data-driven approach to characterize the carbon impact (quantified in units of CO2e) of various artificial intelligence (AI) and extended reality (XR) production-level hardware and application use-cases. We propose a holistic design exploration framework to optimize and design for carbon-efficient computing systems and hardware. Our frameworks identifies significant opportunities for carbon efficiency improvements in application-specific and general purpose hardware design and optimization. Using our framework, we demonstrate 10$\times$ carbon efficiency improvement for specialized AI and XR accelerators (quantified by a key metric, tCDP: the product of total CO2e and total application execution time), up to 21% total life cycle carbon savings for existing general-purpose hardware and applications due to hardware over-provisioning, and up to 7.86$\times$ carbon efficiency improvement using advanced 3D integration techniques for resource-constrained XR systems.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
GreenScale: Carbon-Aware Systems for Edge Computing
Authors:
Young Geun Kim,
Udit Gupta,
Andrew McCrabb,
Yonglak Son,
Valeria Bertacco,
David Brooks,
Carole-Jean Wu
Abstract:
To improve the environmental implications of the growing demand of computing, future applications need to improve the carbon-efficiency of computing infrastructures. State-of-the-art approaches, however, do not consider the intermittent nature of renewable energy. The time and location-based carbon intensity of energy fueling computing has been ignored when determining how computation is carried o…
▽ More
To improve the environmental implications of the growing demand of computing, future applications need to improve the carbon-efficiency of computing infrastructures. State-of-the-art approaches, however, do not consider the intermittent nature of renewable energy. The time and location-based carbon intensity of energy fueling computing has been ignored when determining how computation is carried out. This poses a new challenge -- deciding when and where to run applications across consumer devices at the edge and servers in the cloud. Such scheduling decisions become more complicated with the stochastic runtime variance and the amortization of the rising embodied emissions. This work proposes GreenScale, a framework to understand the design and optimization space of carbon-aware scheduling for green applications across the edge-cloud infrastructure. Based on the quantified carbon output of the infrastructure components, we demonstrate that optimizing for carbon, compared to performance and energy efficiency, yields unique scheduling solutions. Our evaluation with three representative categories of applications (i.e., AI, Game, and AR/VR) demonstrate that the carbon emissions of the applications can be reduced by up to 29.1% with the GreenScale. The analysis in this work further provides a detailed road map for edge-cloud application developers to build green applications.
△ Less
Submitted 1 April, 2023;
originally announced April 2023.
-
MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation
Authors:
Samuel Hsia,
Udit Gupta,
Bilge Acun,
Newsha Ardalani,
Pan Zhong,
Gu-Yeon Wei,
David Brooks,
Carole-Jean Wu
Abstract:
Deep learning recommendation systems serve personalized content under diverse tail-latency targets and input-query loads. In order to do so, state-of-the-art recommendation models rely on terabyte-scale embedding tables to learn user preferences over large bodies of contents. The reliance on a fixed embedding representation of embedding tables not only imposes significant memory capacity and bandw…
▽ More
Deep learning recommendation systems serve personalized content under diverse tail-latency targets and input-query loads. In order to do so, state-of-the-art recommendation models rely on terabyte-scale embedding tables to learn user preferences over large bodies of contents. The reliance on a fixed embedding representation of embedding tables not only imposes significant memory capacity and bandwidth requirements but also limits the scope of compatible system solutions. This paper challenges the assumption of fixed embedding representations by showing how synergies between embedding representations and hardware platforms can lead to improvements in both algorithmic- and system performance. Based on our characterization of various embedding representations, we propose a hybrid embedding representation that achieves higher quality embeddings at the cost of increased memory and compute requirements. To address the system performance challenges of the hybrid representation, we propose MP-Rec -- a co-design technique that exploits heterogeneity and dynamic selection of embedding representations and underlying hardware platforms.
On real system hardware, we demonstrate how matching custom accelerators, i.e., GPUs, TPUs, and IPUs, with compatible embedding representations can lead to 16.65x performance speedup. Additionally, in query-serving scenarios, MP-Rec achieves 2.49x and 3.76x higher correct prediction throughput and 0.19% and 0.22% better model quality on a CPU-GPU system for the Kaggle and Terabyte datasets, respectively.
△ Less
Submitted 21 February, 2023;
originally announced February 2023.
-
AlignGraph: A Group of Generative Models for Graphs
Authors:
Kimia Shayestehfard,
Dana Brooks,
Stratis Ioannidis
Abstract:
It is challenging for generative models to learn a distribution over graphs because of the lack of permutation invariance: nodes may be ordered arbitrarily across graphs, and standard graph alignment is combinatorial and notoriously expensive. We propose AlignGraph, a group of generative models that combine fast and efficient graph alignment methods with a family of deep generative models that are…
▽ More
It is challenging for generative models to learn a distribution over graphs because of the lack of permutation invariance: nodes may be ordered arbitrarily across graphs, and standard graph alignment is combinatorial and notoriously expensive. We propose AlignGraph, a group of generative models that combine fast and efficient graph alignment methods with a family of deep generative models that are invariant to node permutations. Our experiments demonstrate that our framework successfully learns graph distributions, outperforming competitors by 25% -560% in relevant performance scores.
△ Less
Submitted 26 January, 2023;
originally announced January 2023.
-
PerfSAGE: Generalized Inference Performance Predictor for Arbitrary Deep Learning Models on Edge Devices
Authors:
Yuji Chai,
Devashree Tripathy,
Chuteng Zhou,
Dibakar Gope,
Igor Fedorov,
Ramon Matas,
David Brooks,
Gu-Yeon Wei,
Paul Whatmough
Abstract:
The ability to accurately predict deep neural network (DNN) inference performance metrics, such as latency, power, and memory footprint, for an arbitrary DNN on a target hardware platform is essential to the design of DNN based models. This ability is critical for the (manual or automatic) design, optimization, and deployment of practical DNNs for a specific hardware deployment platform. Unfortuna…
▽ More
The ability to accurately predict deep neural network (DNN) inference performance metrics, such as latency, power, and memory footprint, for an arbitrary DNN on a target hardware platform is essential to the design of DNN based models. This ability is critical for the (manual or automatic) design, optimization, and deployment of practical DNNs for a specific hardware deployment platform. Unfortunately, these metrics are slow to evaluate using simulators (where available) and typically require measurement on the target hardware. This work describes PerfSAGE, a novel graph neural network (GNN) that predicts inference latency, energy, and memory footprint on an arbitrary DNN TFlite graph (TFL, 2017). In contrast, previously published performance predictors can only predict latency and are restricted to pre-defined construction rules or search spaces. This paper also describes the EdgeDLPerf dataset of 134,912 DNNs randomly sampled from four task search spaces and annotated with inference performance metrics from three edge hardware platforms. Using this dataset, we train PerfSAGE and provide experimental results that demonstrate state-of-the-art prediction accuracy with a Mean Absolute Percentage Error of <5% across all targets and model search spaces. These results: (1) Outperform previous state-of-art GNN-based predictors (Dudziak et al., 2020), (2) Accurately predict performance on accelerators (a shortfall of non-GNN-based predictors (Zhang et al., 2021)), and (3) Demonstrate predictions on arbitrary input graphs without modifications to the feature extractor.
△ Less
Submitted 26 January, 2023;
originally announced January 2023.
-
GPU-based Private Information Retrieval for On-Device Machine Learning Inference
Authors:
Maximilian Lam,
Jeff Johnson,
Wenjie Xiong,
Kiwan Maeng,
Udit Gupta,
Yang Li,
Liangzhen Lai,
Ilias Leontiadis,
Minsoo Rhu,
Hsien-Hsin S. Lee,
Vijay Janapa Reddi,
Gu-Yeon Wei,
David Brooks,
G. Edward Suh
Abstract:
On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the or…
▽ More
On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the order of 1-10 GBs of data, making them impractical to store on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than $20 \times$ over an optimized CPU PIR implementation, and our PIR-ML co-design provides an over $5 \times$ additional throughput improvement at fixed model quality. Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to $100,000$ queries per second -- a $>100 \times$ throughput improvement over a CPU-based baseline -- while maintaining model accuracy.
△ Less
Submitted 25 September, 2023; v1 submitted 25 January, 2023;
originally announced January 2023.
-
Architectural Implications of Embedding Dimension during GCN on CPU and GPU
Authors:
Matthew Adiletta,
David Brooks,
Gu-Yeon Wei
Abstract:
Graph Neural Networks (GNNs) are a class of neural networks designed to extract information from the graphical structure of data. Graph Convolutional Networks (GCNs) are a widely used type of GNN for transductive graph learning problems which apply convolution to learn information from graphs. GCN is a challenging algorithm from an architecture perspective due to inherent sparsity, low data reuse,…
▽ More
Graph Neural Networks (GNNs) are a class of neural networks designed to extract information from the graphical structure of data. Graph Convolutional Networks (GCNs) are a widely used type of GNN for transductive graph learning problems which apply convolution to learn information from graphs. GCN is a challenging algorithm from an architecture perspective due to inherent sparsity, low data reuse, and massive memory capacity requirements. Traditional neural algorithms exploit the high compute capacity of GPUs to achieve high performance for both inference and training. The architectural decision to use a GPU for GCN inference is a question explored in this work. GCN on both CPU and GPU was characterized in order to better understand the implications of graph size, embedding dimension, and sampling on performance.
△ Less
Submitted 1 December, 2022;
originally announced December 2022.
-
Correlated Feature Aggregation by Region Helps Distinguish Aggressive from Indolent Clear Cell Renal Cell Carcinoma Subtypes on CT
Authors:
Karin Stacke,
Indrani Bhattacharya,
Justin R. Tse,
James D. Brooks,
Geoffrey A. Sonn,
Mirabela Rusu
Abstract:
Renal cell carcinoma (RCC) is a common cancer that varies in clinical behavior. Indolent RCC is often low-grade without necrosis and can be monitored without treatment. Aggressive RCC is often high-grade and can cause metastasis and death if not promptly detected and treated. While most kidney cancers are detected on CT scans, grading is based on histology from invasive biopsy or surgery. Determin…
▽ More
Renal cell carcinoma (RCC) is a common cancer that varies in clinical behavior. Indolent RCC is often low-grade without necrosis and can be monitored without treatment. Aggressive RCC is often high-grade and can cause metastasis and death if not promptly detected and treated. While most kidney cancers are detected on CT scans, grading is based on histology from invasive biopsy or surgery. Determining aggressiveness on CT images is clinically important as it facilitates risk stratification and treatment planning. This study aims to use machine learning methods to identify radiology features that correlate with features on pathology to facilitate assessment of cancer aggressiveness on CT images instead of histology. This paper presents a novel automated method, Correlated Feature Aggregation By Region (CorrFABR), for classifying aggressiveness of clear cell RCC by leveraging correlations between radiology and corresponding unaligned pathology images. CorrFABR consists of three main steps: (1) Feature Aggregation where region-level features are extracted from radiology and pathology images, (2) Fusion where radiology features correlated with pathology features are learned on a region level, and (3) Prediction where the learned correlated features are used to distinguish aggressive from indolent clear cell RCC using CT alone as input. Thus, during training, CorrFABR learns from both radiology and pathology images, but during inference, CorrFABR will distinguish aggressive from indolent clear cell RCC using CT alone, in the absence of pathology images. CorrFABR improved classification performance over radiology features alone, with an increase in binary classification F1-score from 0.68 (0.04) to 0.73 (0.03). This demonstrates the potential of incorporating pathology disease characteristics for improved classification of aggressiveness of clear cell RCC on CT images.
△ Less
Submitted 29 September, 2022;
originally announced September 2022.
-
SpeedLimit: Neural Architecture Search for Quantized Transformer Models
Authors:
Yuji Chai,
Luke Bailey,
Yunho Jin,
Matthew Karle,
Glenn G. Ko,
David Brooks,
Gu-Yeon Wei,
H. T. Kung
Abstract:
While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry often necessitate a rigorous consideration of inference latency constraints. Addressing this challenge, we introduce SpeedLimit, a novel Neural Architecture Search (NAS) technique that optimizes accuracy whilst adhering to an u…
▽ More
While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry often necessitate a rigorous consideration of inference latency constraints. Addressing this challenge, we introduce SpeedLimit, a novel Neural Architecture Search (NAS) technique that optimizes accuracy whilst adhering to an upper-bound latency constraint. Our method incorporates 8-bit integer quantization in the search process to outperform the current state-of-the-art technique. Our results underline the feasibility and efficacy of seeking an optimal balance between performance and latency, providing new avenues for deploying state-of-the-art transformer models in latency-sensitive environments.
△ Less
Submitted 13 October, 2023; v1 submitted 24 September, 2022;
originally announced September 2022.
-
Impala: Low-Latency, Communication-Efficient Private Deep Learning Inference
Authors:
Woo-Seok Choi,
Brandon Reagen,
Gu-Yeon Wei,
David Brooks
Abstract:
This paper proposes Impala, a new cryptographic protocol for private inference in the client-cloud setting. Impala builds upon recent solutions that combine the complementary strengths of homomorphic encryption (HE) and secure multi-party computation (MPC). A series of protocol optimizations are developed to reduce both communication and performance bottlenecks. First, we remove MPC's overwhelming…
▽ More
This paper proposes Impala, a new cryptographic protocol for private inference in the client-cloud setting. Impala builds upon recent solutions that combine the complementary strengths of homomorphic encryption (HE) and secure multi-party computation (MPC). A series of protocol optimizations are developed to reduce both communication and performance bottlenecks. First, we remove MPC's overwhelmingly high communication cost from the client by introducing a proxy server and developing a low-overhead key switching technique. Key switching reduces the clients bandwidth by multiple orders of magnitude, however the communication between the proxy and cloud is still excessive. Second, to we develop an optimized garbled circuit that leverages truncated secret shares for faster evaluation and less proxy-cloud communication. Finally, we propose sparse HE convolution to reduce the computational bottleneck of using HE. Compared to the state-of-the-art, these optimizations provide a bandwidth savings of over 3X and speedup of 4X for private deep learning inference.
△ Less
Submitted 12 May, 2022;
originally announced May 2022.
-
OMU: A Probabilistic 3D Occupancy Mapping Accelerator for Real-time OctoMap at the Edge
Authors:
Tianyu Jia,
En-Yu Yang,
Yu-Shun Hsiao,
Jonathan Cruz,
David Brooks,
Gu-Yeon Wei,
Vijay Janapa Reddi
Abstract:
Autonomous machines (e.g., vehicles, mobile robots, drones) require sophisticated 3D mapping to perceive the dynamic environment. However, maintaining a real-time 3D map is expensive both in terms of compute and memory requirements, especially for resource-constrained edge machines. Probabilistic OctoMap is a reliable and memory-efficient 3D dense map model to represent the full environment, with…
▽ More
Autonomous machines (e.g., vehicles, mobile robots, drones) require sophisticated 3D mapping to perceive the dynamic environment. However, maintaining a real-time 3D map is expensive both in terms of compute and memory requirements, especially for resource-constrained edge machines. Probabilistic OctoMap is a reliable and memory-efficient 3D dense map model to represent the full environment, with dynamic voxel node pruning and expansion capacity. This paper presents the first efficient accelerator solution, i.e. OMU, to enable real-time probabilistic 3D mapping at the edge. To improve the performance, the input map voxels are updated via parallel PE units for data parallelism. Within each PE, the voxels are stored using a specially developed data structure in parallel memory banks. In addition, a pruning address manager is designed within each PE unit to reuse the pruned memory addresses. The proposed 3D mapping accelerator is implemented and evaluated using a commercial 12 nm technology. Compared to the ARM Cortex-A57 CPU in the Nvidia Jetson TX2 platform, the proposed accelerator achieves up to 62$\times$ performance and 708$\times$ energy efficiency improvement. Furthermore, the accelerator provides 63 FPS throughput, more than 2$\times$ higher than a real-time requirement, enabling real-time perception for 3D mapping.
△ Less
Submitted 6 May, 2022;
originally announced May 2022.
-
BioSimulators: a central registry of simulation engines and services for recommending specific tools
Authors:
Bilal Shaikh,
Lucian P. Smith,
Dan Vasilescu,
Gnaneswara Marupilla,
Michael Wilson,
Eran Agmon,
Henry Agnew,
Steven S. Andrews,
Azraf Anwar,
Moritz E. Beber,
Frank T. Bergmann,
David Brooks,
Lutz Brusch,
Laurence Calzone,
Kiri Choi,
Joshua Cooper,
John Detloff,
Brian Drawert,
Michel Dumontier,
G. Bard Ermentrout,
James R. Faeder,
Andrew P. Freiburger,
Fabian Fröhlich,
Akira Funahashi,
Alan Garny
, et al. (46 additional authors not shown)
Abstract:
Computational models have great potential to accelerate bioscience, bioengineering, and medicine. However, it remains challenging to reproduce and reuse simulations, in part, because the numerous formats and methods for simulating various subsystems and scales remain siloed by different software tools. For example, each tool must be executed through a distinct interface. To help investigators find…
▽ More
Computational models have great potential to accelerate bioscience, bioengineering, and medicine. However, it remains challenging to reproduce and reuse simulations, in part, because the numerous formats and methods for simulating various subsystems and scales remain siloed by different software tools. For example, each tool must be executed through a distinct interface. To help investigators find and use simulation tools, we developed BioSimulators (https://biosimulators.org), a central registry of the capabilities of simulation tools and consistent Python, command-line, and containerized interfaces to each version of each tool. The foundation of BioSimulators is standards, such as CellML, SBML, SED-ML, and the COMBINE archive format, and validation tools for simulation projects and simulation tools that ensure these standards are used consistently. To help modelers find tools for particular projects, we have also used the registry to develop recommendation services. We anticipate that BioSimulators will help modelers exchange, reproduce, and combine simulations.
△ Less
Submitted 13 March, 2022;
originally announced March 2022.
-
Tabula: Efficiently Computing Nonlinear Activation Functions for Secure Neural Network Inference
Authors:
Maximilian Lam,
Michael Mitzenmacher,
Vijay Janapa Reddi,
Gu-Yeon Wei,
David Brooks
Abstract:
Multiparty computation approaches to secure neural network inference commonly rely on garbled circuits for securely executing nonlinear activation functions. However, garbled circuits require excessive communication between server and client, impose significant storage overheads, and incur large runtime penalties. To reduce these costs, we propose an alternative to garbled circuits: Tabula, an alg…
▽ More
Multiparty computation approaches to secure neural network inference commonly rely on garbled circuits for securely executing nonlinear activation functions. However, garbled circuits require excessive communication between server and client, impose significant storage overheads, and incur large runtime penalties. To reduce these costs, we propose an alternative to garbled circuits: Tabula, an algorithm based on secure lookup tables. Our approach precomputes lookup tables during an offline phase that contains the result of all possible nonlinear function calls. Because these tables incur exponential storage costs in the number of operands and the precision of the input values, we use quantization to reduce these storage costs to make this approach practical. This enables an online phase where securely computing the result of a nonlinear function requires just a single round of communication, with communication cost equal to twice the number of bits of the input to the nonlinear function. In practice our approach costs 2 bytes of communication per nonlinear function call in the online phase. Compared to garbled circuits with 8-bit quantized inputs, when computing individual nonlinear functions during the online phase, experiments show Tabula with 8-bit activations uses between $280$-$560 \times$ less communication, is over $100\times$ faster, and uses a comparable (within a factor of 2) amount of storage; compared against other state-of-the-art protocols Tabula achieves greater than $40\times$ communication reduction. This leads to significant performance gains over garbled circuits with quantized inputs during the online phase of secure inference of neural networks: Tabula reduces end-to-end inference communication by up to $9 \times$ and achieves an end-to-end inference speedup of up to $50 \times$, while imposing comparable storage and offline preprocessing costs.
△ Less
Submitted 16 June, 2024; v1 submitted 5 March, 2022;
originally announced March 2022.
-
Carbon Explorer: A Holistic Approach for Designing Carbon Aware Datacenters
Authors:
Bilge Acun,
Benjamin Lee,
Fiodar Kazhamiaka,
Kiwan Maeng,
Manoj Chakkaravarthy,
Udit Gupta,
David Brooks,
Carole-Jean Wu
Abstract:
Technology companies have been leading the way to a renewable energy transformation, by investing in renewable energy sources to reduce the carbon footprint of their datacenters. In addition to helping build new solar and wind farms, companies make power purchase agreements or purchase carbon offsets, rather than relying on renewable energy every hour of the day, every day of the week (24/7). Rely…
▽ More
Technology companies have been leading the way to a renewable energy transformation, by investing in renewable energy sources to reduce the carbon footprint of their datacenters. In addition to helping build new solar and wind farms, companies make power purchase agreements or purchase carbon offsets, rather than relying on renewable energy every hour of the day, every day of the week (24/7). Relying on renewable energy 24/7 is challenging due to the intermittent nature of wind and solar energy. Inherent variations in solar and wind energy production causes excess or lack of supply at different times. To cope with the fluctuations of renewable energy generation, multiple solutions must be applied. These include: capacity sizing with a mix of solar and wind power, energy storage options, and carbon aware workload scheduling. However, depending on the region and datacenter workload characteristics, the carbon-optimal solution varies. Existing work in this space does not give a holistic view of the trade-offs of each solution and often ignore the embodied carbon cost of the solutions. In this work, we provide a framework, Carbon Explorer, to analyze the multi-dimensional solution space by taking into account operational and embodided footprint of the solutions to help make datacenters operate on renewable energy 24/7. The solutions we analyze include capacity sizing with a mix of solar and wind power, battery storage, and carbon aware workload scheduling, which entails shifting the workloads from times when there is lack of renewable supply to times with abundant supply.
△ Less
Submitted 21 February, 2023; v1 submitted 24 January, 2022;
originally announced January 2022.
-
Trireme: Exploring Hierarchical Multi-Level Parallelism for Domain Specific Hardware Acceleration
Authors:
Georgios Zacharopoulos,
Adel Ejjeh,
Ying Jing,
En-Yu Yang,
Tianyu Jia,
Iulian Brumar,
Jeremy Intan,
Muhammad Huzaifa,
Sarita Adve,
Vikram Adve,
Gu-Yeon Wei,
David Brooks
Abstract:
The design of heterogeneous systems that include domain specific accelerators is a challenging and time-consuming process. While taking into account area constraints, designers must decide which parts of an application to accelerate in hardware and which to leave in software. Moreover, applications in domains such as Extended Reality (XR) offer opportunities for various forms of parallel execution…
▽ More
The design of heterogeneous systems that include domain specific accelerators is a challenging and time-consuming process. While taking into account area constraints, designers must decide which parts of an application to accelerate in hardware and which to leave in software. Moreover, applications in domains such as Extended Reality (XR) offer opportunities for various forms of parallel execution, including loop level, task level and pipeline parallelism. To assist the design process and expose every possible level of parallelism, we present Trireme, a fully automated tool-chain that explores multiple levels of parallelism and produces domain specific accelerator designs and configurations that maximize performance, given an area budget. Experiments on demanding benchmarks from the XR domain revealed a speedup of up to 20x, as well as a speedup of up to 37x for smaller applications, compared to software-only implementations.
△ Less
Submitted 21 January, 2022;
originally announced January 2022.
-
Bridging the gap between prostate radiology and pathology through machine learning
Authors:
Indrani Bhattacharya,
David S. Lim,
Han Lin Aung,
Xingchen Liu,
Arun Seetharaman,
Christian A. Kunder,
Wei Shao,
Simon J. C. Soerensen,
Richard E. Fan,
Pejman Ghanouni,
Katherine J. To'o,
James D. Brooks,
Geoffrey A. Sonn,
Mirabela Rusu
Abstract:
Prostate cancer is the second deadliest cancer for American men. While Magnetic Resonance Imaging (MRI) is increasingly used to guide targeted biopsies for prostate cancer diagnosis, its utility remains limited due to high rates of false positives and false negatives as well as low inter-reader agreements. Machine learning methods to detect and localize cancer on prostate MRI can help standardize…
▽ More
Prostate cancer is the second deadliest cancer for American men. While Magnetic Resonance Imaging (MRI) is increasingly used to guide targeted biopsies for prostate cancer diagnosis, its utility remains limited due to high rates of false positives and false negatives as well as low inter-reader agreements. Machine learning methods to detect and localize cancer on prostate MRI can help standardize radiologist interpretations. However, existing machine learning methods vary not only in model architecture, but also in the ground truth labeling strategies used for model training. In this study, we compare different labeling strategies, namely, pathology-confirmed radiologist labels, pathologist labels on whole-mount histopathology images, and lesion-level and pixel-level digital pathologist labels (previously validated deep learning algorithm on histopathology images to predict pixel-level Gleason patterns) on whole-mount histopathology images. We analyse the effects these labels have on the performance of the trained machine learning models. Our experiments show that (1) radiologist labels and models trained with them can miss cancers, or underestimate cancer extent, (2) digital pathologist labels and models trained with them have high concordance with pathologist labels, and (3) models trained with digital pathologist labels achieve the best performance in prostate cancer detection in two different cohorts with different disease distributions, irrespective of the model architecture used. Digital pathologist labels can reduce challenges associated with human annotations, including labor, time, inter- and intra-reader variability, and can help bridge the gap between prostate radiology and pathology by enabling the training of reliable machine learning models to detect and localize prostate cancer on MRI.
△ Less
Submitted 3 December, 2021;
originally announced December 2021.
-
Early DSE and Automatic Generation of Coarse Grained Merged Accelerators
Authors:
Iulian Brumar,
Georgios Zacharopoulos,
Yuan Yao,
Saketh Rama,
Gu-Yeon Wei,
David Brooks
Abstract:
Post-Moore's law area-constrained systems rely on accelerators to deliver performance enhancements. Coarse grained accelerators can offer substantial domain acceleration, but manual, ad-hoc identification of code to accelerate is prohibitively expensive. Because cycle-accurate simulators and high-level synthesis flows are so time-consuming, manual creation of high-utilization accelerators that exp…
▽ More
Post-Moore's law area-constrained systems rely on accelerators to deliver performance enhancements. Coarse grained accelerators can offer substantial domain acceleration, but manual, ad-hoc identification of code to accelerate is prohibitively expensive. Because cycle-accurate simulators and high-level synthesis flows are so time-consuming, manual creation of high-utilization accelerators that exploit control and data flow patterns at optimal granularities is rarely successful. To address these challenges, we present AccelMerger, the first automated methodology to create coarse grained, control- and data-flow-rich, merged accelerators. AccelMerger uses sequence alignment matching to recognize similar function call-graphs and loops, and neural networks to quickly evaluate their post-HLS characteristics. It accurately identifies which functions to accelerate, and it merges accelerators to respect an area budget and to accommodate system communication characteristics like latency and bandwidth. Merging two accelerators can save as much as 99% of the area of one. The space saved is used by a globally optimal integer linear program to allocate more accelerators for increased performance. We demonstate AccelMerger's effectiveness using HLS flows without any manual effort to fine-tune the resulting designs. On FPGA-based systems, AccelMerger yields application performance improvements of up to 16.7x over software implementations, and 1.91x on average with respect to state-of-the-art early-stage design space exploration tools.
△ Less
Submitted 17 November, 2021;
originally announced November 2021.
-
Unsupervised Approaches for Out-Of-Distribution Dermoscopic Lesion Detection
Authors:
Max Torop,
Sandesh Ghimire,
Wenqian Liu,
Dana H. Brooks,
Octavia Camps,
Milind Rajadhyaksha,
Jennifer Dy,
Kivanc Kose
Abstract:
There are limited works showing the efficacy of unsupervised Out-of-Distribution (OOD) methods on complex medical data. Here, we present preliminary findings of our unsupervised OOD detection algorithm, SimCLR-LOF, as well as a recent state of the art approach (SSD), applied on medical images. SimCLR-LOF learns semantically meaningful features using SimCLR and uses LOF for scoring if a test sample…
▽ More
There are limited works showing the efficacy of unsupervised Out-of-Distribution (OOD) methods on complex medical data. Here, we present preliminary findings of our unsupervised OOD detection algorithm, SimCLR-LOF, as well as a recent state of the art approach (SSD), applied on medical images. SimCLR-LOF learns semantically meaningful features using SimCLR and uses LOF for scoring if a test sample is OOD. We evaluated on the multi-source International Skin Imaging Collaboration (ISIC) 2019 dataset, and show results that are competitive with SSD as well as with recent supervised approaches applied on the same data.
△ Less
Submitted 8 November, 2021;
originally announced November 2021.
-
Sustainable AI: Environmental Implications, Challenges and Opportunities
Authors:
Carole-Jean Wu,
Ramya Raghavendra,
Udit Gupta,
Bilge Acun,
Newsha Ardalani,
Kiwan Maeng,
Gloria Chang,
Fiona Aga Behram,
James Huang,
Charles Bai,
Michael Gschwind,
Anurag Gupta,
Myle Ott,
Anastasia Melnikov,
Salvatore Candido,
David Brooks,
Geeta Chauhan,
Benjamin Lee,
Hsien-Hsin S. Lee,
Bugra Akyildiz,
Maximilian Balandat,
Joe Spisak,
Ravi Jain,
Mike Rabbat,
Kim Hazelwood
Abstract:
This paper explores the environmental impact of the super-linear growth trends for AI from a holistic perspective, spanning Data, Algorithms, and System Hardware. We characterize the carbon footprint of AI computing by examining the model development cycle across industry-scale machine learning use cases and, at the same time, considering the life cycle of system hardware. Taking a step further, w…
▽ More
This paper explores the environmental impact of the super-linear growth trends for AI from a holistic perspective, spanning Data, Algorithms, and System Hardware. We characterize the carbon footprint of AI computing by examining the model development cycle across industry-scale machine learning use cases and, at the same time, considering the life cycle of system hardware. Taking a step further, we capture the operational and manufacturing carbon footprint of AI computing and present an end-to-end analysis for what and how hardware-software design and at-scale optimization can help reduce the overall carbon footprint of AI. Based on the industry experience and lessons learned, we share the key challenges and chart out important development directions across the many dimensions of AI. We hope the key messages and insights presented in this paper can inspire the community to advance the field of AI in an environmentally-responsible manner.
△ Less
Submitted 9 January, 2022; v1 submitted 30 October, 2021;
originally announced November 2021.
-
Variation is the Norm: Brain State Dynamics Evoked By Emotional Video Clips
Authors:
Ashutosh Singh,
Christiana Westlin,
Hedwig Eisenbarth,
Elizabeth A. Reynolds Losin,
Jessica R. Andrews-Hanna,
Tor D. Wager,
Ajay B. Satpute,
Lisa Feldman Barrett,
Dana H. Brooks,
Deniz Erdogmus
Abstract:
For the last several decades, emotion research has attempted to identify a "biomarker" or consistent pattern of brain activity to characterize a single category of emotion (e.g., fear) that will remain consistent across all instances of that category, regardless of individual and context. In this study, we investigated variation rather than consistency during emotional experiences while people wat…
▽ More
For the last several decades, emotion research has attempted to identify a "biomarker" or consistent pattern of brain activity to characterize a single category of emotion (e.g., fear) that will remain consistent across all instances of that category, regardless of individual and context. In this study, we investigated variation rather than consistency during emotional experiences while people watched video clips chosen to evoke instances of specific emotion categories. Specifically, we developed a sequential probabilistic approach to model the temporal dynamics in a participant's brain activity during video viewing. We characterized brain states during these clips as distinct state occupancy periods between state transitions in blood oxygen level dependent (BOLD) signal patterns. We found substantial variation in the state occupancy probability distributions across individuals watching the same video, supporting the hypothesis that when it comes to the brain correlates of emotional experience, variation may indeed be the norm.
△ Less
Submitted 24 October, 2021;
originally announced October 2021.
-
NVMExplorer: A Framework for Cross-Stack Comparisons of Embedded Non-Volatile Memories
Authors:
Lillian Pentecost,
Alexander Hankin,
Marco Donato,
Mark Hempstead,
Gu-Yeon Wei,
David Brooks
Abstract:
Repeated off-chip memory accesses to DRAM drive up operating power for data-intensive applications, and SRAM technology scaling and leakage power limits the efficiency of embedded memories. Future on-chip storage will need higher density and energy efficiency, and the actively expanding field of emerging, embeddable non-volatile memory (eNVM) technologies is providing many potential candidates to…
▽ More
Repeated off-chip memory accesses to DRAM drive up operating power for data-intensive applications, and SRAM technology scaling and leakage power limits the efficiency of embedded memories. Future on-chip storage will need higher density and energy efficiency, and the actively expanding field of emerging, embeddable non-volatile memory (eNVM) technologies is providing many potential candidates to satisfy this need. Each technology proposal presents distinct trade-offs in terms of density, read, write, and reliability characteristics, and we present a comprehensive framework for navigating and quantifying these design trade-offs alongside realistic system constraints and application-level impacts. This work evaluates eNVM-based storage for a range of application and system contexts including machine learning on the edge, graph analytics, and general purpose cache hierarchy, in addition to describing a freely available (http://nvmexplorer.seas.harvard.edu/) set of tools for application experts, system designers, and device experts to better understand, compare, and quantify the next generation of embedded memory solutions.
△ Less
Submitted 11 January, 2022; v1 submitted 2 September, 2021;
originally announced September 2021.
-
Application-driven Design Exploration for Dense Ferroelectric Embedded Non-volatile Memories
Authors:
Mohammad Mehdi Sharifi,
Lillian Pentecost,
Ramin Rajaei,
Arman Kazemi,
Qiuwen Lou,
Gu-Yeon Wei,
David Brooks,
Kai Ni,
X. Sharon Hu,
Michael Niemier,
Marco Donato
Abstract:
The memory wall bottleneck is a key challenge across many data-intensive applications. Multi-level FeFET-based embedded non-volatile memories are a promising solution for denser and more energy-efficient on-chip memory. However, reliable multi-level cell storage requires careful optimizations to minimize the design overhead costs. In this work, we investigate the interplay between FeFET device cha…
▽ More
The memory wall bottleneck is a key challenge across many data-intensive applications. Multi-level FeFET-based embedded non-volatile memories are a promising solution for denser and more energy-efficient on-chip memory. However, reliable multi-level cell storage requires careful optimizations to minimize the design overhead costs. In this work, we investigate the interplay between FeFET device characteristics, programming schemes, and memory array architecture, and explore different design choices to optimize performance, energy, area, and accuracy metrics for critical data-intensive workloads. From our cross-stack design exploration, we find that we can store DNN weights and social network graphs at a density of over 8MB/mm^2 and sub-2ns read access latency without loss in application accuracy.
△ Less
Submitted 17 June, 2021;
originally announced June 2021.
-
Gradient Disaggregation: Breaking Privacy in Federated Learning by Reconstructing the User Participant Matrix
Authors:
Maximilian Lam,
Gu-Yeon Wei,
David Brooks,
Vijay Janapa Reddi,
Michael Mitzenmacher
Abstract:
We show that aggregated model updates in federated learning may be insecure. An untrusted central server may disaggregate user updates from sums of updates across participants given repeated observations, enabling the server to recover privileged information about individual users' private training data via traditional gradient inference attacks. Our method revolves around reconstructing participa…
▽ More
We show that aggregated model updates in federated learning may be insecure. An untrusted central server may disaggregate user updates from sums of updates across participants given repeated observations, enabling the server to recover privileged information about individual users' private training data via traditional gradient inference attacks. Our method revolves around reconstructing participant information (e.g: which rounds of training users participated in) from aggregated model updates by leveraging summary information from device analytics commonly used to monitor, debug, and manage federated learning systems. Our attack is parallelizable and we successfully disaggregate user updates on settings with up to thousands of participants. We quantitatively and qualitatively demonstrate significant improvements in the capability of various inference attacks on the disaggregated updates. Our attack enables the attribution of learned properties to individual users, violating anonymity, and shows that a determined central server may undermine the secure aggregation protocol to break individual users' data privacy in federated learning.
△ Less
Submitted 10 June, 2021;
originally announced June 2021.
-
MAVFI: An End-to-End Fault Analysis Framework with Anomaly Detection and Recovery for Micro Aerial Vehicles
Authors:
Yu-Shun Hsiao,
Zishen Wan,
Tianyu Jia,
Radhika Ghosal,
Abdulrahman Mahmoud,
Arijit Raychowdhury,
David Brooks,
Gu-Yeon Wei,
Vijay Janapa Reddi
Abstract:
Safety and resilience are critical for autonomous unmanned aerial vehicles (UAVs). We introduce MAVFI, the micro aerial vehicles (MAVs) resilience analysis methodology to assess the effect of silent data corruption (SDC) on UAVs' mission metrics, such as flight time and success rate, for accurately measuring system resilience. To enhance the safety and resilience of robot systems bound by size, we…
▽ More
Safety and resilience are critical for autonomous unmanned aerial vehicles (UAVs). We introduce MAVFI, the micro aerial vehicles (MAVs) resilience analysis methodology to assess the effect of silent data corruption (SDC) on UAVs' mission metrics, such as flight time and success rate, for accurately measuring system resilience. To enhance the safety and resilience of robot systems bound by size, weight, and power (SWaP), we offer two low-overhead anomaly-based SDC detection and recovery algorithms based on Gaussian statistical models and autoencoder neural networks. Our anomaly error protection techniques are validated in numerous simulated environments. We demonstrate that the autoencoder-based technique can recover up to all failure cases in our studied scenarios with a computational overhead of no more than 0.0062%. Our application-aware resilience analysis framework, MAVFI, can be utilized to comprehensively test the resilience of other Robot Operating System (ROS)-based applications and is publicly available at https://github.com/harvard-edge/MAVBench/tree/mavfi.
△ Less
Submitted 30 January, 2023; v1 submitted 26 May, 2021;
originally announced May 2021.
-
RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance
Authors:
Udit Gupta,
Samuel Hsia,
Jeff Zhang,
Mark Wilkening,
Javin Pombra,
Hsien-Hsin S. Lee,
Gu-Yeon Wei,
Carole-Jean Wu,
David Brooks
Abstract:
Deep learning recommendation systems must provide high quality, personalized content under strict tail-latency targets and high system loads. This paper presents RecPipe, a system to jointly optimize recommendation quality and inference performance. Central to RecPipe is decomposing recommendation models into multi-stage pipelines to maintain quality while reducing compute complexity and exposing…
▽ More
Deep learning recommendation systems must provide high quality, personalized content under strict tail-latency targets and high system loads. This paper presents RecPipe, a system to jointly optimize recommendation quality and inference performance. Central to RecPipe is decomposing recommendation models into multi-stage pipelines to maintain quality while reducing compute complexity and exposing distinct parallelism opportunities. RecPipe implements an inference scheduler to map multi-stage recommendation engines onto commodity, heterogeneous platforms (e.g., CPUs, GPUs).While the hardware-aware scheduling improves ranking efficiency, the commodity platforms suffer from many limitations requiring specialized hardware. Thus, we design RecPipeAccel (RPAccel), a custom accelerator that jointly optimizes quality, tail-latency, and system throughput. RPAc-cel is designed specifically to exploit the distinct design space opened via RecPipe. In particular, RPAccel processes queries in sub-batches to pipeline recommendation stages, implements dual static and dynamic embedding caches, a set of top-k filtering units, and a reconfigurable systolic array. Com-pared to prior-art and at iso-quality, we demonstrate that RPAccel improves latency and throughput by 3x and 6x.
△ Less
Submitted 22 May, 2021; v1 submitted 18 May, 2021;
originally announced May 2021.