doi 10.1145/3797905.3800543

Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnects

Authors: Julian Bellavita, Lorenzo Pichetti, Thomas Pasquali, Flavio Vella, Giulia Guidi

Abstract: The multiplication of two sparse matrices, known as SpGEMM, is a key kernel in scientific computing and large-scale data analytics, underpinning graph algorithms, machine learning, simulations, and computational biology, where sparsity is often highly unstructured. The unstructured sparsity makes achieving high performance challenging because it limits both memory efficiency and scalability. In di… ▽ More The multiplication of two sparse matrices, known as SpGEMM, is a key kernel in scientific computing and large-scale data analytics, underpinning graph algorithms, machine learning, simulations, and computational biology, where sparsity is often highly unstructured. The unstructured sparsity makes achieving high performance challenging because it limits both memory efficiency and scalability. In distributed memory, the cost of exchanging and merging partial products across nodes further constrains performance. These issues are exacerbated on modern heterogeneous supercomputers with deep, hierarchical GPU interconnects. Current SpGEMM implementations overlook the gap between intra-node and inter-node bandwidth, resulting in unnecessary data movement and synchronization not fully exploiting the fast intra-node interconnect. To address these challenges, we introduce Trident, a hierarchy-aware 2D distributed SpGEMM algorithm that uses communication-avoiding techniques and asynchronous communication to exploit the hierarchical and heterogeneous architecture of modern supercomputing interconnect. Central to Trident is the novel trident partitioning scheme, which enables hierarchy-aware decomposition and reduces internode communication by leveraging the higher bandwidth between GPUs within a node compared to across nodes. Here, we evaluate Trident on unstructured matrices, achieving up to $2.38\times$ speedup over a 2D SpGEMM with a corresponding geometric mean speedup of $1.54\times$. Trident reduces internode communication volume by up to $2\times$ on NERSC's Perlmutter supercomputer. Furthermore, we demonstrate the effectiveness of Trident in speeding up Markov Clustering, achieving up to $2\times$ speedup compared to competing strategies. △ Less

Submitted 24 March, 2026; v1 submitted 22 March, 2026; originally announced March 2026.

Journal ref: 2026 International Conference on Supercomputing (ICS '26), July 06--09, 2026, Belfast, United Kingdom

arXiv:2603.16105 [pdf, ps, other]

Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

Authors: Francesco Pio Monaco, Elia Cunegatti, Flavio Vella, Giovanni Iacca

Abstract: Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a cri… ▽ More Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://github.com/FrancescoMonaco/ZipCal.}. △ Less

Submitted 7 April, 2026; v1 submitted 17 March, 2026; originally announced March 2026.

Comments: Added in appendix an expanded multilingual study. 19 pages

ACM Class: I.2.7

arXiv:2601.19413 [pdf, ps, other]

NET4EXA: Pioneering the Future of Interconnects for Supercomputing and AI

Authors: Michele Martinelli, Roberto Ammendola, Andrea Biagioni, Carlotta Chiarini, Ottorino Frezza, Francesca Lo Cicero, Alessandro Lonardo, Pier Stanislao Paolucci, Elena Pastorelli, Pierpaolo Perticaroli, Luca Pontisso, Cristian Rossi, Francesco Simula, Piero Vicini, David Colin, Grégoire Pichon, Alexandre Louvet, John Gliksberg, Claire Chen, Matteo Turisini, Andrea Monterubbiano, Jean-Philippe Nominé, Denis Dutoit, Hugo Taboada, Lilia Zaourar , et al. (19 additional authors not shown)

Abstract: NET4EXA aims to develop a next-generation high-performance interconnect for HPC and AI systems, addressing the increasing demands of large-scale infrastructures, such as those required for training Large Language Models. Building upon the proven BXI (Bull eXascale Interconnect) European technology used in TOP15 supercomputers, NET4EXA will deliver the new BXI release, BXIv3, a complete hardware an… ▽ More NET4EXA aims to develop a next-generation high-performance interconnect for HPC and AI systems, addressing the increasing demands of large-scale infrastructures, such as those required for training Large Language Models. Building upon the proven BXI (Bull eXascale Interconnect) European technology used in TOP15 supercomputers, NET4EXA will deliver the new BXI release, BXIv3, a complete hardware and software interconnect solution, including switch and network interface components. The project will integrate a fully functional pilot system at TRL 8, ready for deployment into upcoming exascale and post-exascale systems from 2025 onward. Leveraging prior research from European initiatives like RED-SEA, the previous achievements of consortium partners and over 20 years of expertise from BULL, NET4EXA also lays the groundwork for the future generation of BXI, BXIv4, providing analysis and preliminary design. The project will use a hybrid development and co-design approach, combining commercial switch technology with custom IP and FPGA-based NICs. Performances of NET4EXA BXIv3 interconnect will be evaluated using a broad portfolio of benchmarks, scientific scalable applications, and AI workloads. △ Less

Submitted 27 January, 2026; originally announced January 2026.

arXiv:2601.17136 [pdf, ps, other]

Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

Authors: Julian Bellavita, Matthew Rubino, Nakul Iyer, Andrew Chang, Aditya Devarakonda, Flavio Vella, Giulia Guidi

Abstract: Clustering is an important tool in data analysis, with K-means being popular for its simplicity and versatility. However, it cannot handle non-linearly separable clusters. Kernel K-means addresses this limitation but requires a large kernel matrix, making it computationally and memory intensive. Prior work has accelerated Kernel K-means by formulating it using sparse linear algebra primitives and… ▽ More Clustering is an important tool in data analysis, with K-means being popular for its simplicity and versatility. However, it cannot handle non-linearly separable clusters. Kernel K-means addresses this limitation but requires a large kernel matrix, making it computationally and memory intensive. Prior work has accelerated Kernel K-means by formulating it using sparse linear algebra primitives and implementing it on a single GPU. However, that approach cannot run on datasets with more than approximately 80,000 samples due to limited GPU memory. In this work, we address this issue by presenting a suite of distributed-memory parallel algorithms for large-scale Kernel K-means clustering on multi-GPU systems. Our approach maps the most computationally expensive components of Kernel K-means onto communication-efficient distributed linear algebra primitives uniquely tailored for Kernel K-means, enabling highly scalable implementations that efficiently cluster million-scale datasets. Central to our work is the design of partitioning schemes that enable communication-efficient composition of the linear algebra primitives that appear in Kernel K-means. Our 1.5D algorithm consistently achieves the highest performance, enabling Kernel K-means to scale to data one to two orders of magnitude larger than previously practical. On 256 GPUs, it achieves a geometric mean weak scaling efficiency of $79.7\%$ and a geometric mean strong scaling speedup of $4.2\times$. Compared to our 1D algorithm, the 1.5D approach achieves up to a $3.6\times$ speedup on 256 GPUs and reduces clustering time from over an hour to under two seconds relative to a single-GPU sliding window implementation. Our results show that distributed algorithms designed with application-specific linear algebraic formulations can achieve substantial performance improvement. △ Less

Submitted 27 January, 2026; v1 submitted 23 January, 2026; originally announced January 2026.

Journal ref: Proceedings of the 40th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2026

arXiv:2510.16487 [pdf, ps, other]

doi 10.1109/ISVLSI65124.2025.11130220

Architecture, Simulation and Software Stack to Support Post-CMOS Accelerators: The ARCHYTAS Project

Authors: Giovanni Agosta, Stefano Cherubin, Derek Christ, Francesco Conti, Asbjørn Djupdal, Matthias Jung, Georgios Keramidas, Roberto Passerone, Paolo Rech, Elisa Ricci, Philippe Velha, Flavio Vella, Kasim Sinan Yildirim, Nils Wilbert

Abstract: ARCHYTAS aims to design and evaluate non-conventional hardware accelerators, in particular, optoelectronic, volatile and non-volatile processing-in-memory, and neuromorphic, to tackle the power, efficiency, and scalability bottlenecks of AI with an emphasis on defense use cases (e.g., autonomous vehicles, surveillance drones, maritime and space platforms). In this paper, we present the system arch… ▽ More ARCHYTAS aims to design and evaluate non-conventional hardware accelerators, in particular, optoelectronic, volatile and non-volatile processing-in-memory, and neuromorphic, to tackle the power, efficiency, and scalability bottlenecks of AI with an emphasis on defense use cases (e.g., autonomous vehicles, surveillance drones, maritime and space platforms). In this paper, we present the system architecture and software stack that ARCHYTAS will develop to integrate and support those accelerators, as well as the simulation software needed for early prototyping of the full system and its components. △ Less

Submitted 18 October, 2025; originally announced October 2025.

Journal ref: 2025 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

arXiv:2505.21404 [pdf, ps, other]

Dual Natural Gradient Descent for Scalable Training of Physics-Informed Neural Networks

Authors: Anas Jnini, Flavio Vella

Abstract: Natural-gradient methods markedly accelerate the training of Physics-Informed Neural Networks (PINNs), yet their Gauss--Newton update must be solved in the parameter space, incurring a prohibitive $O(n^3)$ time complexity, where $n$ is the number of network trainable weights. We show that exactly the same step can instead be formulated in a generally smaller residual space of size… ▽ More Natural-gradient methods markedly accelerate the training of Physics-Informed Neural Networks (PINNs), yet their Gauss--Newton update must be solved in the parameter space, incurring a prohibitive $O(n^3)$ time complexity, where $n$ is the number of network trainable weights. We show that exactly the same step can instead be formulated in a generally smaller residual space of size $m = \sum_γ N_γ d_γ$, where each residual class $γ$ (e.g. PDE interior, boundary, initial data) contributes $N_γ$ collocation points of output dimension $d_γ$. Building on this insight, we introduce \textit{Dual Natural Gradient Descent} (D-NGD). D-NGD computes the Gauss--Newton step in residual space, augments it with a geodesic-acceleration correction at negligible extra cost, and provides both a dense direct solver for modest $m$ and a Nystrom-preconditioned conjugate-gradient solver for larger $m$. Experimentally, D-NGD scales second-order PINN optimization to networks with up to 12.8 million parameters, delivers one- to three-order-of-magnitude lower final error $L^2$ than first-order methods (Adam, SGD) and quasi-Newton methods, and -- crucially -- enables natural-gradient training of PINNs at this scale on a single GPU. △ Less

Submitted 8 October, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

arXiv:2503.11196 [pdf, other]

Physics-constrained DeepONet for Surrogate CFD models: a curved backward-facing step case

Authors: Anas Jnini, Harshinee Goordoyal, Sujal Dave, Flavio Vella, Katharine H. Fraser, Artem Korobenko

Abstract: The Physics-Constrained DeepONet (PC-DeepONet), an architecture that incorporates fundamental physics knowledge into the data-driven DeepONet model, is presented in this study. This methodology is exemplified through surrogate modeling of fluid dynamics over a curved backward-facing step, a benchmark problem in computational fluid dynamics. The model was trained on computational fluid dynamics dat… ▽ More The Physics-Constrained DeepONet (PC-DeepONet), an architecture that incorporates fundamental physics knowledge into the data-driven DeepONet model, is presented in this study. This methodology is exemplified through surrogate modeling of fluid dynamics over a curved backward-facing step, a benchmark problem in computational fluid dynamics. The model was trained on computational fluid dynamics data generated for a range of parameterized geometries. The PC-DeepONet was able to learn the mapping from the parameters describing the geometry to the velocity and pressure fields. While the DeepONet is solely data-driven, the PC-DeepONet imposes the divergence constraint from the continuity equation onto the network. The PC-DeepONet demonstrates higher accuracy than the data-driven baseline, especially when trained on sparse data. Both models attain convergence with a small dataset of 50 samples and require only 50 iterations for convergence, highlighting the efficiency of neural operators in learning the dynamics governed by partial differential equations. △ Less

Submitted 14 March, 2025; originally announced March 2025.

arXiv:2503.00755 [pdf, ps, other]

Riemann Tensor Neural Networks: Learning Conservative Systems with Physics-Constrained Networks

Authors: Anas Jnini, Lorenzo Breschi, Flavio Vella

Abstract: Divergence-free symmetric tensors (DFSTs) are fundamental in continuum mechanics, encoding conservation laws such as mass and momentum conservation. We introduce Riemann Tensor Neural Networks (RTNNs), a novel neural architecture that inherently satisfies the DFST condition to machine precision, providing a strong inductive bias for enforcing these conservation laws. We prove that RTNNs can approx… ▽ More Divergence-free symmetric tensors (DFSTs) are fundamental in continuum mechanics, encoding conservation laws such as mass and momentum conservation. We introduce Riemann Tensor Neural Networks (RTNNs), a novel neural architecture that inherently satisfies the DFST condition to machine precision, providing a strong inductive bias for enforcing these conservation laws. We prove that RTNNs can approximate any sufficiently smooth DFST with arbitrary precision and demonstrate their effectiveness as surrogates for conservative PDEs, achieving improved accuracy across benchmarks. This work is the first to use DFSTs as an inductive bias in neural PDE surrogates and to explicitly enforce the conservation of both mass and momentum within a physics-constrained neural architecture. △ Less

Submitted 16 June, 2025; v1 submitted 2 March, 2025; originally announced March 2025.

Comments: To be published in the Proceedings of the Forty-Second International Conference on Machine Learning

arXiv:2501.05587 [pdf, other]

doi 10.1145/3710848.3710887

Popcorn: Accelerating Kernel K-means on GPUs through Sparse Linear Algebra

Authors: Julian Bellavita, Thomas Pasquali, Laura Del Rio Martin, Flavio Vella, Giulia Guidi

Abstract: K-means is a popular clustering algorithm with significant applications in numerous scientific and engineering areas. One drawback of K-means is its inability to identify non-linearly separable clusters, which may lead to inaccurate solutions in certain cases. Kernel K-means is a variant of classical K-means that can find non-linearly separable clusters. However, it scales quadratically with respe… ▽ More K-means is a popular clustering algorithm with significant applications in numerous scientific and engineering areas. One drawback of K-means is its inability to identify non-linearly separable clusters, which may lead to inaccurate solutions in certain cases. Kernel K-means is a variant of classical K-means that can find non-linearly separable clusters. However, it scales quadratically with respect to the size of the dataset, taking several minutes to cluster even medium-sized datasets on traditional CPU-based machines. In this paper, we present a formulation of Kernel K-means using sparse-dense matrix multiplication (SpMM) and sparse matrix-vector multiplication (SpMV), and we show that our formulation enables the rapid implementation of a fast GPU-based version of Kernel K-means with little programming effort. Our implementation, named Popcorn, is the first open-source GPU-based implementation of Kernel K-means. Popcorn achieves a speedup of up to 123.8x over a CPU implementation of Kernel K-means and a speedup of up to 2.6x over a GPU implementation of Kernel K-means that does not use sparse matrix computations. Our results support the effectiveness of sparse matrices as tools for efficient parallel programming. △ Less

Submitted 9 January, 2025; originally announced January 2025.

arXiv:2411.11391 [pdf, other]

doi 10.1007/s00607-026-01642-z

The GECo algorithm for Graph Neural Networks Explanation

Authors: Salvatore Calderaro, Domenico Amato, Giosuè Lo Bosco, Riccardo Rizzo, Filippo Vella

Abstract: Graph Neural Networks (GNNs) are powerful models that can manage complex data sources and their interconnection links. One of GNNs' main drawbacks is their lack of interpretability, which limits their application in sensitive fields. In this paper, we introduce a new methodology involving graph communities to address the interpretability of graph classification problems. The proposed method, calle… ▽ More Graph Neural Networks (GNNs) are powerful models that can manage complex data sources and their interconnection links. One of GNNs' main drawbacks is their lack of interpretability, which limits their application in sensitive fields. In this paper, we introduce a new methodology involving graph communities to address the interpretability of graph classification problems. The proposed method, called GECo, exploits the idea that if a community is a subset of graph nodes densely connected, this property should play a role in graph classification. This is reasonable, especially if we consider the message-passing mechanism, which is the basic mechanism of GNNs. GECo analyzes the contribution to the classification result of the communities in the graph, building a mask that highlights graph-relevant structures. GECo is tested for Graph Convolutional Networks on six artificial and four real-world graph datasets and is compared to the main explainability methods such as PGMExplainer, PGExplainer, GNNExplainer, and SubgraphX using four different metrics. The obtained results outperform the other methods for artificial graph datasets and most real-world datasets. △ Less

Submitted 18 November, 2024; originally announced November 2024.

arXiv:2409.09874 [pdf, ps, other]

The Landscape of GPU-Centric Communication

Authors: Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Doğan Sağbili, Flavio Vella, Daniele De Sensi, Ismayil Ismayilov

Abstract: In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now ch… ▽ More In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now challenge this CPU dominance by reducing its involvement, granting GPUs more autonomy in communication tasks, and addressing mismatches in multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing on vendor mechanisms and user-level library supports. It aims to clarify the complexities and diverse options in this field, define the terminology, and categorize existing approaches within and across nodes. The paper discusses vendor-provided mechanisms for communication and memory management in multi-GPU execution and reviews major communication libraries, their benefits, challenges, and performance insights. Then, it explores key research paradigms, future outlooks, and open research questions. By extensively describing GPU-centric communication techniques across the software and hardware stacks, we provide researchers, programmers, engineers, and library designers insights on how to exploit multi-GPU systems at their best. △ Less

Submitted 22 February, 2026; v1 submitted 15 September, 2024; originally announced September 2024.

arXiv:2408.14090 [pdf, other]

doi 10.1109/SC41406.2024.00039

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Authors: Daniele De Sensi, Lorenzo Pichetti, Flavio Vella, Tiziano De Matteis, Zebin Ren, Luigi Fusco, Matteo Turisini, Daniele Cesarini, Kurt Lust, Animesh Trivedi, Duncan Roweth, Filippo Spiga, Salvatore Di Girolamo, Torsten Hoefler

Abstract: Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This pape… ▽ More Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization. △ Less

Submitted 15 November, 2024; v1 submitted 26 August, 2024; originally announced August 2024.

ACM Class: C.2.4; C.5.1; C.2.1; C.4

Journal ref: Published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '24) (2024)

arXiv:2408.11551 [pdf, other]

High Performance Unstructured SpMM Computation Using Tensor Cores

Authors: Patrik Okanovic, Grzegorz Kwasniewski, Paolo Sylos Labini, Maciej Besta, Flavio Vella, Torsten Hoefler

Abstract: High-performance sparse matrix-matrix (SpMM) multiplication is paramount for science and industry, as the ever-increasing sizes of data prohibit using dense data structures. Yet, existing hardware, such as Tensor Cores (TC), is ill-suited for SpMM, as it imposes strict constraints on data structures that cannot be met by unstructured sparsity found in many applications. To address this, we introdu… ▽ More High-performance sparse matrix-matrix (SpMM) multiplication is paramount for science and industry, as the ever-increasing sizes of data prohibit using dense data structures. Yet, existing hardware, such as Tensor Cores (TC), is ill-suited for SpMM, as it imposes strict constraints on data structures that cannot be met by unstructured sparsity found in many applications. To address this, we introduce (S)parse (Ma)trix Matrix (T)ensor Core-accelerated (SMaT): a novel SpMM library that utilizes TCs for unstructured sparse matrices. Our block-sparse library leverages the low-level CUDA MMA (matrix-matrix-accumulate) API, maximizing the performance offered by modern GPUs. Algorithmic optimizations such as sparse matrix permutation further improve performance by minimizing the number of non-zero blocks. The evaluation on NVIDIA A100 shows that SMaT outperforms SotA libraries (DASP, cuSPARSE, and Magicube) by up to 125x (on average 2.6x). SMaT can be used to accelerate many workloads in scientific computing, large-model training, inference, and others. △ Less

Submitted 21 August, 2024; originally announced August 2024.

Comments: Accepted by 2024 International Conference on High Performance Computing, Networking, Storage and Analysis, 2023 (SC'24)

arXiv:2408.10484 [pdf, other]

Dependable Classical-Quantum Computer Systems Engineering

Authors: Edoardo Giusto, Santiago Nuñez-Corrales, Phuong Cao, Alessandro Cilardo, Ravishankar K. Iyer, Weiwen Jiang, Paolo Rech, Flavio Vella, Bartolomeo Montrucchio, Samudra Dasgupta, Travis S. Humble

Abstract: Quantum Computing (QC) offers the potential to enhance traditional High-Performance Computing (HPC) workloads by leveraging the unique properties of quantum computers, leading to the emergence of a new paradigm: HPC-QC. While this integration presents new opportunities, it also brings novel challenges, particularly in ensuring the dependability of such hybrid systems. This paper aims to identify i… ▽ More Quantum Computing (QC) offers the potential to enhance traditional High-Performance Computing (HPC) workloads by leveraging the unique properties of quantum computers, leading to the emergence of a new paradigm: HPC-QC. While this integration presents new opportunities, it also brings novel challenges, particularly in ensuring the dependability of such hybrid systems. This paper aims to identify integration challenges, anticipate failures, and foster a diverse co-design for HPC-QC systems by bringing together QC, cloud computing, HPC, and network security. The focus of this emerging inter-disciplinary effort is to develop engineering principles that ensure the dependability of hybrid systems, aiming for a more prescriptive co-design cycle. Our framework will help to prevent design pitfalls and accelerate the maturation of the QC technology ecosystem. Key aspects include building resilient HPC-QC systems, analyzing the applicability of conventional techniques to the quantum domain, and exploring the complexity of scaling in such hybrid systems. This underscores the need for performance-reliability metrics specific to this new computational paradigm. △ Less

Submitted 19 August, 2024; originally announced August 2024.

arXiv:2408.09229 [pdf, other]

cuVegas: Accelerate Multidimensional Monte Carlo Integration through a Parallelized CUDA-based Implementation of the VEGAS Enhanced Algorithm

Authors: Emiliano Tolotti, Anas Jnini, Flavio Vella, Roberto Passerone

Abstract: This paper introduces cuVegas, a CUDA-based implementation of the Vegas Enhanced Algorithm (VEGAS+), optimized for multi-dimensional integration in GPU environments. The VEGAS+ algorithm is an advanced form of Monte Carlo integration, recognized for its adaptability and effectiveness in handling complex, high-dimensional integrands. It employs a combination of variance reduction techniques, namely… ▽ More This paper introduces cuVegas, a CUDA-based implementation of the Vegas Enhanced Algorithm (VEGAS+), optimized for multi-dimensional integration in GPU environments. The VEGAS+ algorithm is an advanced form of Monte Carlo integration, recognized for its adaptability and effectiveness in handling complex, high-dimensional integrands. It employs a combination of variance reduction techniques, namely adaptive importance sampling and a variant of adaptive stratified sampling, that make it particularly adept at managing integrands with multiple peaks or those aligned with the diagonals of the integration volume. Being a Monte Carlo integration method, the task is well suited for parallelization and for GPU execution. Our implementation, cuVegas, aims to harness the inherent parallelism of GPUs, addressing the challenge of workload distribution that often hampers efficiency in standard implementations. We present a comprehensive analysis comparing cuVegas with existing CPU and GPU implementations, demonstrating significant performance improvements, from two to three orders of magnitude on CPUs, and from a factor of two on GPUs over the best existing implementation. We also demonstrate the speedup for integrands for which VEGAS+ was designed, with multiple peaks or other significant structures aligned with diagonals of the integration volume. △ Less

Submitted 17 August, 2024; originally announced August 2024.

Comments: 25 pages, 8 figures

arXiv:2407.12024 [pdf, other]

Leveraging Large Language Models for enhanced personalised user experience in Smart Homes

Authors: Jordan Rey-Jouanchicot, André Bottaro, Eric Campo, Jean-Léon Bouraoui, Nadine Vigouroux, Frédéric Vella

Abstract: Smart home automation systems aim to improve the comfort and convenience of users in their living environment. However, adapting automation to user needs remains a challenge. Indeed, many systems still rely on hand-crafted routines for each smart object.This paper presents an original smart home architecture leveraging Large Language Models (LLMs) and user preferences to push the boundaries of per… ▽ More Smart home automation systems aim to improve the comfort and convenience of users in their living environment. However, adapting automation to user needs remains a challenge. Indeed, many systems still rely on hand-crafted routines for each smart object.This paper presents an original smart home architecture leveraging Large Language Models (LLMs) and user preferences to push the boundaries of personalisation and intuitiveness in the home environment.This article explores a human-centred approach that uses the general knowledge provided by LLMs to learn and facilitate interactions with the environment.The advantages of the proposed model are demonstrated on a set of scenarios, as well as a comparative analysis with various LLM implementations. Some metrics are assessed to determine the system's ability to maintain comfort, safety, and user preferences. The paper details the approach to real-world implementation and evaluation.The proposed approach of using preferences shows up to 52.3% increase in average grade, and with an average processing time reduced by 35.6% on Starling 7B Alpha LLM. In addition, performance is 26.4% better than the results of the larger models without preferences, with processing time almost 20 times faster. △ Less

Submitted 28 June, 2024; originally announced July 2024.

arXiv:2407.10841 [pdf, other]

doi 10.1109/SC41406.2024.00075

On the Efficacy of Surface Codes in Compensating for Radiation Events in Superconducting Devices

Authors: Marzio Vallero, Gioele Casagranda, Flavio Vella, Paolo Rech

Abstract: Reliability is fundamental for developing large-scale quantum computers. Since the benefit of technological advancements to the qubit's stability is saturating, algorithmic solutions, such as quantum error correction (QEC) codes, are needed to bridge the gap to reliable computation. Unfortunately, the deployment of the first quantum computers has identified faults induced by natural radiation as a… ▽ More Reliability is fundamental for developing large-scale quantum computers. Since the benefit of technological advancements to the qubit's stability is saturating, algorithmic solutions, such as quantum error correction (QEC) codes, are needed to bridge the gap to reliable computation. Unfortunately, the deployment of the first quantum computers has identified faults induced by natural radiation as an additional threat to qubits reliability. The high sensitivity of qubits to radiation hinders the large-scale adoption of quantum computers, since the persistence and area-of-effect of the fault can potentially undermine the efficacy of the most advanced QEC. In this paper, we investigate the resilience of various implementations of state-of-the-art QEC codes to radiation-induced faults. We report data from over 400 million fault injections and correlate hardware faults with the logical error observed after decoding the code output, extrapolating physical-to-logical error rates. We compare the code's radiation-induced logical error rate over the code distance, the number and role in the QEC of physical qubits, the underlying quantum computer topology, and particle energy spread in the chip. We show that, by simply selecting and tuning properly the surface code, thus without introducing any overhead, the probability of correcting a radiation-induced fault is increased by up to 10\%. Finally, we provide indications and guidelines for the design of future QEC codes to further increase their effectiveness against radiation-induced events. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 10 pages, 8 figures

arXiv:2401.06188 [pdf, other]

doi 10.1016/j.future.2025.107927

State of practice: evaluating GPU performance of state vector and tensor network methods

Authors: Marzio Vallero, Flavio Vella, Paolo Rech

Abstract: The frontier of quantum computing (QC) simulation on classical hardware is quickly reaching the hard scalability limits for computational feasibility. Nonetheless, there is still a need to simulate large quantum systems classically, as the Noisy Intermediate Scale Quantum (NISQ) devices are yet to be considered fault tolerant and performant enough in terms of operations per second. Each of the two… ▽ More The frontier of quantum computing (QC) simulation on classical hardware is quickly reaching the hard scalability limits for computational feasibility. Nonetheless, there is still a need to simulate large quantum systems classically, as the Noisy Intermediate Scale Quantum (NISQ) devices are yet to be considered fault tolerant and performant enough in terms of operations per second. Each of the two main exact simulation techniques, state vector and tensor network simulators, boasts specific limitations. The exponential memory requirement of state vector simulation, when compared to the qubit register sizes of currently available quantum computers, quickly saturates the capacity of the top HPC machines currently available. Tensor network contraction approaches, which encode quantum circuits into tensor networks and then contract them over an output bit string to obtain its probability amplitude, still fall short of the inherent complexity of finding an optimal contraction path, which maps to a max-cut problem on a dense mesh, a notably NP-hard problem. This article aims at investigating the limits of current state-of-the-art simulation techniques on a test bench made of eight widely used quantum subroutines, each in 31 different configurations, with special emphasis on performance. We then correlate the performance measures of the simulators with the metrics that characterise the benchmark circuits, identifying the main reasons behind the observed performance trend. From our observations, given the structure of a quantum circuit and the number of qubits, we highlight how to select the best simulation strategy, obtaining a speedup of up to an order of magnitude. △ Less

Submitted 3 February, 2025; v1 submitted 11 January, 2024; originally announced January 2024.

Comments: 13 pages, 10 figures, 1 table

arXiv:2311.13898 [pdf]

HandiMathKey-Device

Authors: Frédéric Vella, Nathalie Dubus, Eloise Grolleau, Marjorie Deleau, Cécile Malet, Christine Gallard, Véronique Ades, Nadine Vigouroux

Abstract: Typing mathematics is sometimes difficult with text editor functions for students with motor impairment and other associated impairments (visual, cognitive). Based on the HandiMathKey software keyboard, a user-centred design method involving the ecosytem of disabled students was applied to design the HMK-D physical keyboard for mathematical input. We opted for the Stream Deck device because of its… ▽ More Typing mathematics is sometimes difficult with text editor functions for students with motor impairment and other associated impairments (visual, cognitive). Based on the HandiMathKey software keyboard, a user-centred design method involving the ecosytem of disabled students was applied to design the HMK-D physical keyboard for mathematical input. We opted for the Stream Deck device because of its multimedia features and its appeal to young students to the HMK-D. Preliminary tests with 8 students (5 in secondary school and 3 in high school) shows that HMK-D is highly accepted, accessible and fun for mathematical input by students with impairments. A longitudinal study of the usability and acceptability of HMK-D is planned for the 2023-2024 school year. △ Less

Submitted 23 November, 2023; originally announced November 2023.

Comments: Universal Access in Human-Computer Interaction. HCII 2023, Jul 2023, Copenhagen (Virtual), Denmark

arXiv:2311.13894 [pdf]

A first step towards an ecosystem meta-model for humancentered design in case of disabled users

Authors: Christophe Kolski, Nadine Vigouroux, Yohan Guerrier, Frédéric Vella, Marine Guffroy

Abstract: The involvement of the ecosystem or social environment of the disabled user is considered as very useful and even essential for the human-centered design of assistive technologies. In the era of model-based approaches, the modeling of the ecosystem is therefore to be considered. The first version of a metamodel of ecosystem is proposed. It is illustrated through a first case study. It concerns a p… ▽ More The involvement of the ecosystem or social environment of the disabled user is considered as very useful and even essential for the human-centered design of assistive technologies. In the era of model-based approaches, the modeling of the ecosystem is therefore to be considered. The first version of a metamodel of ecosystem is proposed. It is illustrated through a first case study. It concerns a project aiming at a communication aid for people with cerebral palsy. A conclusion and research perspectives end this paper. △ Less

Submitted 23 November, 2023; originally announced November 2023.

Journal ref: Disab2023 Engineering Interactive Computing Systems for People with Disabilities, Jun 2023, Swansea, United Kingdom

arXiv:2311.13223 [pdf]

doi 10.1007/978-3-031-35681-0_13

Design Recommendations Based on Speech Analysis for Disability-Friendly Interfaces for the Control of a Home Automation Environment

Authors: Nadine Vigouroux, Frédéric Vella, Gaëlle Lepage, Éric Campo

Abstract: The objective of this paper is to describe the study on speech interaction mode for home automation control of equipment by impaired people for an inclusive housing. The study is related to the HIP HOPE project concerning a building of 19 inclusive housing units. 7 participants with different types of disabilities were invited to carry out use cases using voice and touch control. Only the results… ▽ More The objective of this paper is to describe the study on speech interaction mode for home automation control of equipment by impaired people for an inclusive housing. The study is related to the HIP HOPE project concerning a building of 19 inclusive housing units. 7 participants with different types of disabilities were invited to carry out use cases using voice and touch control. Only the results obtained on the voice interaction mode through the Amazon voice assistant are reported here. The results show, according to the type of handicap, the success rates in the speech recognition of the command emitted on the equipment and highlight the errors related to the formulation, the noisy environment, the intelligible speech, the speech segmentation and the bad synchronization of the audio channel opening. △ Less

Submitted 22 November, 2023; originally announced November 2023.

Journal ref: Universal Access in Human-Computer Interaction. HCII 2023, Jul 2023, Copenhagen (Virtual), Denmark. pp.197-211

arXiv:2306.00606 [pdf]

Scaling Expected Force: Efficient Identification of Key Nodes in Network-based Epidemic Models

Authors: Paolo Sylos Labini, Andrej Jurco, Matteo Ceccarello, Stefano Guarino, Enrico Mastrostefano, Flavio Vella

Abstract: Centrality measures are fundamental tools of network analysis as they highlight the key actors within the network. This study focuses on a newly proposed centrality measure, Expected Force (EF), and its use in identifying spreaders in network-based epidemic models. We found that EF effectively predicts the spreading power of nodes and identifies key nodes and immunization targets. However, its hig… ▽ More Centrality measures are fundamental tools of network analysis as they highlight the key actors within the network. This study focuses on a newly proposed centrality measure, Expected Force (EF), and its use in identifying spreaders in network-based epidemic models. We found that EF effectively predicts the spreading power of nodes and identifies key nodes and immunization targets. However, its high computational cost presents a challenge for its use in large networks. To overcome this limitation, we propose two parallel scalable algorithms for computing EF scores: the first algorithm is based on the original formulation, while the second one focuses on a cluster-centric approach to improve efficiency and scalability. Our implementations significantly reduce computation time, allowing for the detection of key nodes at large scales. Performance analysis on synthetic and real-world networks demonstrates that the GPU implementation of our algorithm can efficiently scale to networks with up to 44 million edges by exploiting modern parallel architectures, achieving speed-ups of up to 300x, and 50x on average, compared to the simple parallel solution. △ Less

Submitted 1 June, 2023; originally announced June 2023.

arXiv:2303.02352 [pdf, other]

doi 10.1109/TPDS.2023.3287238

Multi-GPU aggregation-based AMG preconditioner for iterative linear solvers

Authors: Massimo Bernaschi, Alessandro Celestini, Pasqua D'Ambra, Flavio Vella

Abstract: We present and release in open source format a sparse linear solver which efficiently exploits heterogeneous parallel computers. The solver can be easily integrated into scientific applications that need to solve large and sparse linear systems on modern parallel computers made of hybrid nodes hosting NVIDIA Graphics Processing Unit (GPU) accelerators. The work extends our previous efforts in th… ▽ More We present and release in open source format a sparse linear solver which efficiently exploits heterogeneous parallel computers. The solver can be easily integrated into scientific applications that need to solve large and sparse linear systems on modern parallel computers made of hybrid nodes hosting NVIDIA Graphics Processing Unit (GPU) accelerators. The work extends our previous efforts in the exploitation of a single GPU accelerator and proposes an implementation, based on the hybrid MPI-CUDA software environment, of a Krylov-type linear solver relying on an efficient Algebraic MultiGrid (AMG) preconditioner already available in the BootCMatchG library. Our design for the hybrid implementation has been driven by the best practices for minimizing data communication overhead when multiple GPUs are employed, yet preserving the efficiency of the single GPU kernels. Strong and weak scalability results on well-known benchmark test cases of the new version of the library are discussed. Comparisons with the Nvidia AmgX solution show an improvement of up to 2.0x in the solve phase. △ Less

Submitted 4 March, 2023; originally announced March 2023.

Journal ref: IEEE Transactions on Parallel and Distributed Systems (2023)

arXiv:2212.05031 [pdf, other]

doi 10.1007/978-3-030-24289-3_49

Towards a learning-based performance modeling for accelerating Deep Neural Networks

Authors: Damiano Perri, Paolo Sylos Labini, Osvaldo Gervasi, Sergio Tasso, Flavio Vella

Abstract: Emerging applications such as Deep Learning are often data-driven, thus traditional approaches based on auto-tuners are not performance effective across the wide range of inputs used in practice. In the present paper, we start an investigation of predictive models based on machine learning techniques in order to optimize Convolution Neural Networks (CNNs). As a use-case, we focus on the ARM Comput… ▽ More Emerging applications such as Deep Learning are often data-driven, thus traditional approaches based on auto-tuners are not performance effective across the wide range of inputs used in practice. In the present paper, we start an investigation of predictive models based on machine learning techniques in order to optimize Convolution Neural Networks (CNNs). As a use-case, we focus on the ARM Compute Library which provides three different implementations of the convolution operator at different numeric precision. Starting from a collation of benchmarks, we build and validate models learned by Decision Tree and naive Bayesian classifier. Preliminary experiments on Midgard-based ARM Mali GPU show that our predictive model outperforms all the convolution operators manually selected by the library. △ Less

Submitted 9 December, 2022; originally announced December 2022.

arXiv:2211.13079 [pdf]

User Centred Method to Design a Platform to Design Augmentative and Alternative Communication Assistive Technologies

Authors: Frédéric Vella, Flavien Clastres-Babou, Nadine Vigouroux, Philippe Truillet, Charline Calmels, Caroline Mercadier, Karine Gigaud, Margot Issanchou, Kristina Gourinovitch, Anne Garaix

Abstract: We describe a co-design approach to design the online WebSoKeyTo used to design AAC. This co-design was carried out between a team of therapists and a team of human-computer interaction researchers. Our approach begins with the use and evaluation of an existing SoKeyTo AAC design application. This step was essential in the awareness and definition of the needs by the therapists and in the understa… ▽ More We describe a co-design approach to design the online WebSoKeyTo used to design AAC. This co-design was carried out between a team of therapists and a team of human-computer interaction researchers. Our approach begins with the use and evaluation of an existing SoKeyTo AAC design application. This step was essential in the awareness and definition of the needs by the therapists and in the understanding of the poor usability scores of SoKeyTo by the researchers. We then describe the various phases (focus group, brainstorming, prototyping) with the co-design choices retained. An evaluation of WebSoKeyTo is in progress. △ Less

Submitted 23 November, 2022; originally announced November 2022.

Journal ref: HCI INTERNATIONAL 2022 24TH INTERNATIONAL CONFERENCE ON HUMAN-COMPUTER INTERACTION, Jun 2022, Virtual conference, France. pp.559-571, \&\#x27E8;10.1007/978-3-031-17902-0\_40\&\#x27E9

arXiv:2211.13078 [pdf]

Participation of Stakeholder in the Design of a Conception Application of Augmentative and Alternative Communication

Authors: Frédéric Vella, Flavien Clastres-Babou, Frédéric Vella, Nadine Vigouroux, Philippe Truillet, Nadine Vigouroux, Charline Calmels, Caroline Mercadier, Karine Gigaud, Margot Issanchou, Kristina Gourinovitch, Anne Garaix

Abstract: The objective of this paper is to describe the implication of an interdisciplinary team involved during a user-centered design methodology to design the platform (WebSoKeyTo) that meets the needs of therapists to design augmentative and alternative communication (AAC) aids for disabled users. We describe the processes of the design process and the role of the various actors (therapists and human c… ▽ More The objective of this paper is to describe the implication of an interdisciplinary team involved during a user-centered design methodology to design the platform (WebSoKeyTo) that meets the needs of therapists to design augmentative and alternative communication (AAC) aids for disabled users. We describe the processes of the design process and the role of the various actors (therapists and human computer researchers) in the various phases of the process. Finally, we analyze a satisfaction scale of the therapists on their participation in the codesign process. This study demonstrates the interest in extending the design actors to other therapists and caregivers (professional and family) in the daily life of people with disabilities. △ Less

Submitted 23 November, 2022; originally announced November 2022.

Journal ref: ICCHP-AAATE 2022 Open Access Compendium ''Assistive Technology, Accessibility and (e)Inclusion'', Jul 2022, Lecco, Italy. \&\#x27E8;10.35011/icchp-aaate22-p1-17\&\#x27E9

arXiv:2211.13058 [pdf]

IDEALI: intuitively localising connected devices in order to support autonomy

Authors: Frédéric Vella, Réjane Dalcé, Antonio Serpa, Thierry Val, Adrien van Den Bossche, Frédéric Vella, Nadine Vigouroux

Abstract: The ability to localise a smart device is very useful to visually or cognitively impaired people. Localisation-capable technologies are becoming more readily available as off-the-shelf components. In this paper, we highlight the need for such a service in the field of health and autonomy, especially for disabled people. We introduce a model for Semantic Position Description (SPD) (e.g. "The pill o… ▽ More The ability to localise a smart device is very useful to visually or cognitively impaired people. Localisation-capable technologies are becoming more readily available as off-the-shelf components. In this paper, we highlight the need for such a service in the field of health and autonomy, especially for disabled people. We introduce a model for Semantic Position Description (SPD) (e.g. "The pill organiser in on the kitchen table") as well as various algorithms that transform raw distance estimations to SPD related to proximity, alignment and room identification. Two of these algorithms are evaluated using the LocURa4IoT testbed. The results are compared to the output of a pre-experiment involving ten human participants in the Maison Intelligente de Blagnac. The two studies indicate that both approaches converge up to 90% of the time. . △ Less

Submitted 23 November, 2022; originally announced November 2022.

arXiv:2211.13042 [pdf]

Usability Study of Tactile and Voice Interaction Modes by People with Disabilities for Home Automation Controls

Authors: Nadine Vigouroux, Frédéric Vella, Gaëlle Lepage, Eric Campo

Abstract: This paper presents a comparative usability study on tactile and vocal interaction modes for home automation control of equipment at home for different profiles of disabled people. The study is related to the HIP HOPE project concerning the construction of 19 inclusive housing in the Toulouse metropolitan area in France. The experimentation took place in a living lab with 7 different disabled peop… ▽ More This paper presents a comparative usability study on tactile and vocal interaction modes for home automation control of equipment at home for different profiles of disabled people. The study is related to the HIP HOPE project concerning the construction of 19 inclusive housing in the Toulouse metropolitan area in France. The experimentation took place in a living lab with 7 different disabled people who realize realistic use cases. The USE and UEQ questionnaires were selected as usability tools. The first results show that both interfaces are easy to learn but that usefulness and ease of use dimensions need to be improved. This study shows that there is real need for multimodality between touch and voice interaction to control the smart home. This study also shows that there is need to adapt the interface and the environment to the person's disability. △ Less

Submitted 23 November, 2022; originally announced November 2022.

Journal ref: ICCHP-AAATE 2022 Open Access Compendium ''Assistive Technology, Accessibility and (e)Inclusion'', Jul 2022, Lecco, Italy. pp.139-147, \&\#x27E8;10.1007/978-3-031-08645-8\_17\&\#x27E9

arXiv:2208.11469 [pdf, other]

ProbGraph: High-Performance and High-Accuracy Graph Mining with Probabilistic Set Representations

Authors: Maciej Besta, Cesare Miglioli, Paolo Sylos Labini, Jakub Tětek, Patrick Iff, Raghavendra Kanakagiri, Saleh Ashkboos, Kacper Janda, Michal Podstawski, Grzegorz Kwasniewski, Niels Gleinig, Flavio Vella, Onur Mutlu, Torsten Hoefler

Abstract: Important graph mining problems such as Clustering are computationally demanding. To significantly accelerate these problems, we propose ProbGraph: a graph representation that enables simple and fast approximate parallel graph mining with strong theoretical guarantees on work, depth, and result accuracy. The key idea is to represent sets of vertices using probabilistic set representations such as… ▽ More Important graph mining problems such as Clustering are computationally demanding. To significantly accelerate these problems, we propose ProbGraph: a graph representation that enables simple and fast approximate parallel graph mining with strong theoretical guarantees on work, depth, and result accuracy. The key idea is to represent sets of vertices using probabilistic set representations such as Bloom filters. These representations are much faster to process than the original vertex sets thanks to vectorizability and small size. We use these representations as building blocks in important parallel graph mining algorithms such as Clique Counting or Clustering. When enhanced with ProbGraph, these algorithms significantly outperform tuned parallel exact baselines (up to nearly 50x on 32 cores) while ensuring accuracy of more than 90% for many input graph datasets. Our novel bounds and algorithms based on probabilistic set representations with desirable statistical properties are of separate interest for the data analytics community. △ Less

Submitted 21 November, 2022; v1 submitted 24 August, 2022; originally announced August 2022.

Comments: Best Paper Award at ACM/IEEE Supercomputing'22 (SC22)

Journal ref: Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis, November 2022

arXiv:2202.13976 [pdf, other]

Asynchronous Distributed-Memory Triangle Counting and LCC with RMA Caching

Authors: András Strausz, Flavio Vella, Salvatore Di Girolamo, Maciej Besta, Torsten Hoefler

Abstract: Triangle count and local clustering coefficient are two core metrics for graph analysis. They find broad application in analyses such as community detection and link recommendation. Current state-of-the-art solutions suffer from synchronization overheads or expensive pre-computations needed to distribute the graph, achieving limited scaling capabilities. We propose a fully asynchronous implementat… ▽ More Triangle count and local clustering coefficient are two core metrics for graph analysis. They find broad application in analyses such as community detection and link recommendation. Current state-of-the-art solutions suffer from synchronization overheads or expensive pre-computations needed to distribute the graph, achieving limited scaling capabilities. We propose a fully asynchronous implementation for triangle counting and local clustering coefficient based on 1D partitioning, using remote memory accesses for transferring data and avoid synchronization. Additionally, we show how these algorithms present data reuse on remote memory accesses and how the overall communication time can be improved by caching these accesses. Finally, we extend CLaMPI, a software-layer caching system for MPI RMA, to include application-specific scores for cached entries and influence the eviction procedure to improve caching efficiency. Our results show improvements on shared memory, and we achieve 14x speedup from 4 to 64 nodes for the LiveJournal1 graph on distributed memory. Moreover, we demonstrate how caching remote accesses reduces total running time by up to 73% with respect to a non-cached version. Finally, we compare our implementation to TriC, the 2020 graph champion paper, and achieve up to 100x faster results for scale-free graphs. △ Less

Submitted 1 March, 2022; v1 submitted 28 February, 2022; originally announced February 2022.

Comments: 11 pages, 10 figures, to be published at IPDPS'22

arXiv:2202.05868 [pdf, other]

Blocking Techniques for Sparse Matrix Multiplication on Tensor Accelerators

Authors: Paolo Sylos Labini, Massimo Bernaschi, Francesco Silvestri, Flavio Vella

Abstract: Tensor accelerators have gained popularity because they provide a cheap and efficient solution for speeding up computational-expensive tasks in Deep Learning and, more recently, in other Scientific Computing applications. However, since their features are specifically designed for tensor algebra (typically dense matrix-product), it is commonly assumed that they are not suitable for applications wi… ▽ More Tensor accelerators have gained popularity because they provide a cheap and efficient solution for speeding up computational-expensive tasks in Deep Learning and, more recently, in other Scientific Computing applications. However, since their features are specifically designed for tensor algebra (typically dense matrix-product), it is commonly assumed that they are not suitable for applications with sparse data. To challenge this viewpoint, we discuss methods and present solutions for accelerating sparse matrix multiplication on such architectures. In particular, we present a 1-dimensional blocking algorithm with theoretical guarantees on the density, which builds dense blocks from arbitrary sparse matrices. Experimental results show that, even for unstructured and highly-sparse matrices, our block-based solution which exploits Nvidia Tensor Cores is faster than its sparse counterpart. We observed significant speed-ups of up to two orders of magnitude on real-world sparse matrices. △ Less

Submitted 11 February, 2022; originally announced February 2022.

Comments: 12 pages, 14 images

arXiv:1909.01786 [pdf, other]

GPU-based parallelism for ASP-solving

Authors: Agostino Dovier, Andrea Formisano, Flavio Vella

Abstract: Answer Set Programming (ASP) has become, the paradigm of choice in the field of logic programming and non-monotonic reasoning. Thanks to the availability of efficient solvers, ASP has been successfully employed in a large number of application domains. The term GPU-computing indicates a recent programming paradigm aimed at enabling the use of modern parallel Graphical Processing Units (GPUs) for g… ▽ More Answer Set Programming (ASP) has become, the paradigm of choice in the field of logic programming and non-monotonic reasoning. Thanks to the availability of efficient solvers, ASP has been successfully employed in a large number of application domains. The term GPU-computing indicates a recent programming paradigm aimed at enabling the use of modern parallel Graphical Processing Units (GPUs) for general purpose computing. In this paper we describe an approach to ASP-solving that exploits GPU parallelism. The design of a GPU-based solver poses various challenges due to the peculiarities of GPUs' software and hardware architectures and to the intrinsic nature of the satisfiability problem. △ Less

Submitted 4 September, 2019; originally announced September 2019.

Comments: Part of DECLARE 19 proceedings

arXiv:1908.06649 [pdf, other]

A Computational Model for Tensor Core Units

Authors: Rezaul Chowdhury, Francesco Silvestri, Flavio Vella

Abstract: To respond to the need of efficient training and inference of deep neural networks, a plethora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for efficiently computing a dense matrix multiplication of a given small size. In order to broaden the class of alg… ▽ More To respond to the need of efficient training and inference of deep neural networks, a plethora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for efficiently computing a dense matrix multiplication of a given small size. In order to broaden the class of algorithms that exploit these systems, we propose a computational model, named the TCU model, that captures the ability to natively multiply small matrices. We then use the TCU model for designing fast algorithms for several problems, including matrix operations (dense and sparse multiplication, Gaussian Elimination), graph algorithms (transitive closure, all pairs shortest distances), Discrete Fourier Transform, stencil computations, integer multiplication, and polynomial evaluation. We finally highlight a relation between the TCU model and the external memory model. △ Less

Submitted 9 July, 2020; v1 submitted 19 August, 2019; originally announced August 2019.

arXiv:1806.07060 [pdf, other]

doi 10.1145/3434402

A model-driven approach for a new generation of adaptive libraries

Authors: Marco Cianfriglia, Flavio Vella, Cedric Nugteren, Anton Lokhmotov, Grigori Fursin

Abstract: Efficient high-performance libraries often expose multiple tunable parameters to provide highly optimized routines. These can range from simple loop unroll factors or vector sizes all the way to algorithmic changes, given that some implementations can be more suitable for certain devices by exploiting hardware characteristics such as local memories and vector units. Traditionally, such parameters… ▽ More Efficient high-performance libraries often expose multiple tunable parameters to provide highly optimized routines. These can range from simple loop unroll factors or vector sizes all the way to algorithmic changes, given that some implementations can be more suitable for certain devices by exploiting hardware characteristics such as local memories and vector units. Traditionally, such parameters and algorithmic choices are tuned and then hard-coded for a specific architecture and for certain characteristics of the inputs. However, emerging applications are often data-driven, thus traditional approaches are not effective across the wide range of inputs and architectures used in practice. In this paper, we present a new adaptive framework for data-driven applications which uses a predictive model to select the optimal algorithmic parameters by training with synthetic and real datasets. We demonstrate the effectiveness of a BLAS library and specifically on its matrix multiplication routine. We present experimental results for two GPU architectures and show significant performance gains of up to 3x (on a high-end NVIDIA Pascal GPU) and 2.5x (on an embedded ARM Mali GPU) when compared to a traditionally optimized library. △ Less

Submitted 19 June, 2018; originally announced June 2018.

Comments: New detailed analysis will be provided

Report number: Volume 18 Issue 1 Pages 1-24

Journal ref: ACM Transactions on Architecture and Code Optimization 2021

arXiv:1710.03647 [pdf, other]

Accelerating Energy Games Solvers on Modern Architectures

Authors: Andrea Formisano, Raffaella Gentilini, Flavio Vella

Abstract: Quantitative games, where quantitative objectives are defined on weighted game arenas, provide natural tools for designing faithful models of embedded controllers. Instances of these games that recently gained interest are the so called Energy Games. The fast-known algorithm solves Energy Games in O(EVW) where W is the maximum weight. Starting from a sequential baseline implementation, we investig… ▽ More Quantitative games, where quantitative objectives are defined on weighted game arenas, provide natural tools for designing faithful models of embedded controllers. Instances of these games that recently gained interest are the so called Energy Games. The fast-known algorithm solves Energy Games in O(EVW) where W is the maximum weight. Starting from a sequential baseline implementation, we investigate the use of massively data computation capabilities supported by modern Graphics Processing Units to solve the `initial credit problem' for Energy Games. We present four different parallel implementations on multi-core CPU and GPU systems. Our solution outperforms the baseline implementation by up to 36x speedup and obtains a faster convergence time on real-world graphs. △ Less

Submitted 10 October, 2017; originally announced October 2017.

arXiv:1707.01489 [pdf, other]

Creative Robot Dance with Variational Encoder

Authors: Agnese Augello, Emanuele Cipolla, Ignazio Infantino, Adriano Manfre, Giovanni Pilato, Filippo Vella

Abstract: What we appreciate in dance is the ability of people to sponta- neously improvise new movements and choreographies, sur- rendering to the music rhythm, being inspired by the cur- rent perceptions and sensations and by previous experiences, deeply stored in their memory. Like other human abilities, this, of course, is challenging to reproduce in an artificial entity such as a robot. Recent generati… ▽ More What we appreciate in dance is the ability of people to sponta- neously improvise new movements and choreographies, sur- rendering to the music rhythm, being inspired by the cur- rent perceptions and sensations and by previous experiences, deeply stored in their memory. Like other human abilities, this, of course, is challenging to reproduce in an artificial entity such as a robot. Recent generations of anthropomor- phic robots, the so-called humanoids, however, exhibit more and more sophisticated skills and raised the interest in robotic communities to design and experiment systems devoted to automatic dance generation. In this work, we highlight the importance to model a computational creativity behavior in dancing robots to avoid a mere execution of preprogrammed dances. In particular, we exploit a deep learning approach that allows a robot to generate in real time new dancing move- ments according to to the listened music. △ Less

Submitted 5 July, 2017; originally announced July 2017.

Comments: This paper is an extended version of a paper published on the eighth International Conference on Computational Creativity (ICCC), held in Atlanta, GA, June 20th-June 22nd, 2017

arXiv:1609.07008 [pdf, other]

Scaling betweenness centrality using communication-efficient sparse matrix multiplication

Authors: Edgar Solomonik, Maciej Besta, Flavio Vella, Torsten Hoefler

Abstract: Betweenness centrality (BC) is a crucial graph problem that measures the significance of a vertex by the number of shortest paths leading through it. We propose Maximal Frontier Betweenness Centrality (MFBC): a succinct BC algorithm based on novel sparse matrix multiplication routines that performs a factor of $p^{1/3}$ less communication on $p$ processors than the best known alternatives, for gra… ▽ More Betweenness centrality (BC) is a crucial graph problem that measures the significance of a vertex by the number of shortest paths leading through it. We propose Maximal Frontier Betweenness Centrality (MFBC): a succinct BC algorithm based on novel sparse matrix multiplication routines that performs a factor of $p^{1/3}$ less communication on $p$ processors than the best known alternatives, for graphs with $n$ vertices and average degree $k=n/p^{2/3}$. We formulate, implement, and prove the correctness of MFBC for weighted graphs by leveraging monoids instead of semirings, which enables a surprisingly succinct formulation. MFBC scales well for both extremely sparse and relatively dense graphs. It automatically searches a space of distributed data decompositions and sparse matrix multiplication algorithms for the most advantageous configuration. The MFBC implementation outperforms the well-known CombBLAS library by up to 8x and shows more robust performance. Our design methodology is readily extensible to other graph problems. △ Less

Submitted 9 August, 2017; v1 submitted 22 September, 2016; originally announced September 2016.

ACM Class: G.1.0; G.2.2

arXiv:1602.00963 [pdf, other]

doi 10.1145/3182656

Algorithms and Heuristics for Scalable Betweenness Centrality Computation on Multi-GPU Systems

Authors: Flavio Vella, Giancarlo Carbone, Massimo Bernaschi

Abstract: Betweenness Centrality (BC) is steadily growing in popularity as a metrics of the influence of a vertex in a graph. The BC score of a vertex is proportional to the number of all-pairs-shortest-paths passing through it. However, complete and exact BC computation for a large-scale graph is an extraordinary challenge that requires high performance computing techniques to provide results in a reasonab… ▽ More Betweenness Centrality (BC) is steadily growing in popularity as a metrics of the influence of a vertex in a graph. The BC score of a vertex is proportional to the number of all-pairs-shortest-paths passing through it. However, complete and exact BC computation for a large-scale graph is an extraordinary challenge that requires high performance computing techniques to provide results in a reasonable amount of time. Our approach combines bi-dimensional (2-D) decomposition of the graph and multi-level parallelism together with a suitable data-thread mapping that overcomes most of the difficulties caused by the irregularity of the computation on GPUs. Furthermore, we propose novel heuristics which exploit the topology information of the graph in order to reduce time and space requirements of BC computation. Experimental results on synthetic and real-world graphs show that the proposed techniques allow the BC computation of graphs which are too large to fit in the memory of a single computational node along with a significant reduction of the computing time. △ Less

Submitted 2 February, 2016; originally announced February 2016.

Journal ref: Journal of Experimental Algorithmics (JEA) 2018

arXiv:1601.00669 [pdf, other]

Artwork creation by a cognitive architecture integrating computational creativity and dual process approaches

Authors: Agnese Augello, Ignazio Infantino, Antonio Lieto, Giovanni Pilato, Riccardo Rizzo, Filippo Vella

Abstract: The paper proposes a novel cognitive architecture (CA) for computational creativity based on the Psi model and on the mechanisms inspired by dual process theories of reasoning and rationality. In recent years, many cognitive models have focused on dual process theories to better describe and implement complex cognitive skills in artificial agents, but creativity has been approached only at a descr… ▽ More The paper proposes a novel cognitive architecture (CA) for computational creativity based on the Psi model and on the mechanisms inspired by dual process theories of reasoning and rationality. In recent years, many cognitive models have focused on dual process theories to better describe and implement complex cognitive skills in artificial agents, but creativity has been approached only at a descriptive level. In previous works we have described various modules of the cognitive architecture that allows a robot to execute creative paintings. By means of dual process theories we refine some relevant mechanisms to obtain artworks, and in particular we explain details about the resolution level of the CA dealing with different strategies of access to the Long Term Memory (LTM) and managing the interaction between S1 and S2 processes of the dual process theory. The creative process involves both divergent and convergent processes in either implicit or explicit manner. This leads to four activities (exploratory, reflective, tacit, and analytic) that, triggered by urges and motivations, generate creative acts. These creative acts exploit both the LTM and the WM in order to make novel substitutions to a perceived image by properly mixing parts of pictures coming from different domains. The paper highlights the role of the interaction between S1 and S2 processes, modulated by the resolution level, which focuses the attention of the creative agent by broadening or narrowing the exploration of novel solutions, or even drawing the solution from a set of already made associations. An example of artificial painter is described in some experimentations by using a robotic platform. △ Less

Submitted 4 January, 2016; originally announced January 2016.

Comments: 30 pages, 8 figures, to appear in Biologically Inspired Cognitive Architectures 2016

Showing 1–39 of 39 results for author: Vella, F