-
Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnects
Authors:
Julian Bellavita,
Lorenzo Pichetti,
Thomas Pasquali,
Flavio Vella,
Giulia Guidi
Abstract:
The multiplication of two sparse matrices, known as SpGEMM, is a key kernel in scientific computing and large-scale data analytics, underpinning graph algorithms, machine learning, simulations, and computational biology, where sparsity is often highly unstructured. The unstructured sparsity makes achieving high performance challenging because it limits both memory efficiency and scalability. In di…
▽ More
The multiplication of two sparse matrices, known as SpGEMM, is a key kernel in scientific computing and large-scale data analytics, underpinning graph algorithms, machine learning, simulations, and computational biology, where sparsity is often highly unstructured. The unstructured sparsity makes achieving high performance challenging because it limits both memory efficiency and scalability. In distributed memory, the cost of exchanging and merging partial products across nodes further constrains performance. These issues are exacerbated on modern heterogeneous supercomputers with deep, hierarchical GPU interconnects. Current SpGEMM implementations overlook the gap between intra-node and inter-node bandwidth, resulting in unnecessary data movement and synchronization not fully exploiting the fast intra-node interconnect. To address these challenges, we introduce Trident, a hierarchy-aware 2D distributed SpGEMM algorithm that uses communication-avoiding techniques and asynchronous communication to exploit the hierarchical and heterogeneous architecture of modern supercomputing interconnect. Central to Trident is the novel trident partitioning scheme, which enables hierarchy-aware decomposition and reduces internode communication by leveraging the higher bandwidth between GPUs within a node compared to across nodes. Here, we evaluate Trident on unstructured matrices, achieving up to $2.38\times$ speedup over a 2D SpGEMM with a corresponding geometric mean speedup of $1.54\times$. Trident reduces internode communication volume by up to $2\times$ on NERSC's Perlmutter supercomputer. Furthermore, we demonstrate the effectiveness of Trident in speeding up Markov Clustering, achieving up to $2\times$ speedup compared to competing strategies.
△ Less
Submitted 24 March, 2026; v1 submitted 22 March, 2026;
originally announced March 2026.
-
Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization
Authors:
Francesco Pio Monaco,
Elia Cunegatti,
Flavio Vella,
Giovanni Iacca
Abstract:
Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a cri…
▽ More
Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://github.com/FrancescoMonaco/ZipCal.}.
△ Less
Submitted 7 April, 2026; v1 submitted 17 March, 2026;
originally announced March 2026.
-
NET4EXA: Pioneering the Future of Interconnects for Supercomputing and AI
Authors:
Michele Martinelli,
Roberto Ammendola,
Andrea Biagioni,
Carlotta Chiarini,
Ottorino Frezza,
Francesca Lo Cicero,
Alessandro Lonardo,
Pier Stanislao Paolucci,
Elena Pastorelli,
Pierpaolo Perticaroli,
Luca Pontisso,
Cristian Rossi,
Francesco Simula,
Piero Vicini,
David Colin,
Grégoire Pichon,
Alexandre Louvet,
John Gliksberg,
Claire Chen,
Matteo Turisini,
Andrea Monterubbiano,
Jean-Philippe Nominé,
Denis Dutoit,
Hugo Taboada,
Lilia Zaourar
, et al. (19 additional authors not shown)
Abstract:
NET4EXA aims to develop a next-generation high-performance interconnect for HPC and AI systems, addressing the increasing demands of large-scale infrastructures, such as those required for training Large Language Models. Building upon the proven BXI (Bull eXascale Interconnect) European technology used in TOP15 supercomputers, NET4EXA will deliver the new BXI release, BXIv3, a complete hardware an…
▽ More
NET4EXA aims to develop a next-generation high-performance interconnect for HPC and AI systems, addressing the increasing demands of large-scale infrastructures, such as those required for training Large Language Models. Building upon the proven BXI (Bull eXascale Interconnect) European technology used in TOP15 supercomputers, NET4EXA will deliver the new BXI release, BXIv3, a complete hardware and software interconnect solution, including switch and network interface components. The project will integrate a fully functional pilot system at TRL 8, ready for deployment into upcoming exascale and post-exascale systems from 2025 onward. Leveraging prior research from European initiatives like RED-SEA, the previous achievements of consortium partners and over 20 years of expertise from BULL, NET4EXA also lays the groundwork for the future generation of BXI, BXIv4, providing analysis and preliminary design. The project will use a hybrid development and co-design approach, combining commercial switch technology with custom IP and FPGA-based NICs. Performances of NET4EXA BXIv3 interconnect will be evaluated using a broad portfolio of benchmarks, scientific scalable applications, and AI workloads.
△ Less
Submitted 27 January, 2026;
originally announced January 2026.
-
Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs
Authors:
Julian Bellavita,
Matthew Rubino,
Nakul Iyer,
Andrew Chang,
Aditya Devarakonda,
Flavio Vella,
Giulia Guidi
Abstract:
Clustering is an important tool in data analysis, with K-means being popular for its simplicity and versatility. However, it cannot handle non-linearly separable clusters. Kernel K-means addresses this limitation but requires a large kernel matrix, making it computationally and memory intensive. Prior work has accelerated Kernel K-means by formulating it using sparse linear algebra primitives and…
▽ More
Clustering is an important tool in data analysis, with K-means being popular for its simplicity and versatility. However, it cannot handle non-linearly separable clusters. Kernel K-means addresses this limitation but requires a large kernel matrix, making it computationally and memory intensive. Prior work has accelerated Kernel K-means by formulating it using sparse linear algebra primitives and implementing it on a single GPU. However, that approach cannot run on datasets with more than approximately 80,000 samples due to limited GPU memory.
In this work, we address this issue by presenting a suite of distributed-memory parallel algorithms for large-scale Kernel K-means clustering on multi-GPU systems. Our approach maps the most computationally expensive components of Kernel K-means onto communication-efficient distributed linear algebra primitives uniquely tailored for Kernel K-means, enabling highly scalable implementations that efficiently cluster million-scale datasets. Central to our work is the design of partitioning schemes that enable communication-efficient composition of the linear algebra primitives that appear in Kernel K-means.
Our 1.5D algorithm consistently achieves the highest performance, enabling Kernel K-means to scale to data one to two orders of magnitude larger than previously practical. On 256 GPUs, it achieves a geometric mean weak scaling efficiency of $79.7\%$ and a geometric mean strong scaling speedup of $4.2\times$. Compared to our 1D algorithm, the 1.5D approach achieves up to a $3.6\times$ speedup on 256 GPUs and reduces clustering time from over an hour to under two seconds relative to a single-GPU sliding window implementation. Our results show that distributed algorithms designed with application-specific linear algebraic formulations can achieve substantial performance improvement.
△ Less
Submitted 27 January, 2026; v1 submitted 23 January, 2026;
originally announced January 2026.
-
Architecture, Simulation and Software Stack to Support Post-CMOS Accelerators: The ARCHYTAS Project
Authors:
Giovanni Agosta,
Stefano Cherubin,
Derek Christ,
Francesco Conti,
Asbjørn Djupdal,
Matthias Jung,
Georgios Keramidas,
Roberto Passerone,
Paolo Rech,
Elisa Ricci,
Philippe Velha,
Flavio Vella,
Kasim Sinan Yildirim,
Nils Wilbert
Abstract:
ARCHYTAS aims to design and evaluate non-conventional hardware accelerators, in particular, optoelectronic, volatile and non-volatile processing-in-memory, and neuromorphic, to tackle the power, efficiency, and scalability bottlenecks of AI with an emphasis on defense use cases (e.g., autonomous vehicles, surveillance drones, maritime and space platforms). In this paper, we present the system arch…
▽ More
ARCHYTAS aims to design and evaluate non-conventional hardware accelerators, in particular, optoelectronic, volatile and non-volatile processing-in-memory, and neuromorphic, to tackle the power, efficiency, and scalability bottlenecks of AI with an emphasis on defense use cases (e.g., autonomous vehicles, surveillance drones, maritime and space platforms). In this paper, we present the system architecture and software stack that ARCHYTAS will develop to integrate and support those accelerators, as well as the simulation software needed for early prototyping of the full system and its components.
△ Less
Submitted 18 October, 2025;
originally announced October 2025.
-
Dual Natural Gradient Descent for Scalable Training of Physics-Informed Neural Networks
Authors:
Anas Jnini,
Flavio Vella
Abstract:
Natural-gradient methods markedly accelerate the training of Physics-Informed Neural Networks (PINNs), yet their Gauss--Newton update must be solved in the parameter space, incurring a prohibitive $O(n^3)$ time complexity, where $n$ is the number of network trainable weights. We show that exactly the same step can instead be formulated in a generally smaller residual space of size…
▽ More
Natural-gradient methods markedly accelerate the training of Physics-Informed Neural Networks (PINNs), yet their Gauss--Newton update must be solved in the parameter space, incurring a prohibitive $O(n^3)$ time complexity, where $n$ is the number of network trainable weights. We show that exactly the same step can instead be formulated in a generally smaller residual space of size $m = \sum_γ N_γ d_γ$, where each residual class $γ$ (e.g. PDE interior, boundary, initial data) contributes $N_γ$ collocation points of output dimension $d_γ$.
Building on this insight, we introduce \textit{Dual Natural Gradient Descent} (D-NGD). D-NGD computes the Gauss--Newton step in residual space, augments it with a geodesic-acceleration correction at negligible extra cost, and provides both a dense direct solver for modest $m$ and a Nystrom-preconditioned conjugate-gradient solver for larger $m$.
Experimentally, D-NGD scales second-order PINN optimization to networks with up to 12.8 million parameters, delivers one- to three-order-of-magnitude lower final error $L^2$ than first-order methods (Adam, SGD) and quasi-Newton methods, and -- crucially -- enables natural-gradient training of PINNs at this scale on a single GPU.
△ Less
Submitted 8 October, 2025; v1 submitted 27 May, 2025;
originally announced May 2025.
-
Physics-constrained DeepONet for Surrogate CFD models: a curved backward-facing step case
Authors:
Anas Jnini,
Harshinee Goordoyal,
Sujal Dave,
Flavio Vella,
Katharine H. Fraser,
Artem Korobenko
Abstract:
The Physics-Constrained DeepONet (PC-DeepONet), an architecture that incorporates fundamental physics knowledge into the data-driven DeepONet model, is presented in this study. This methodology is exemplified through surrogate modeling of fluid dynamics over a curved backward-facing step, a benchmark problem in computational fluid dynamics. The model was trained on computational fluid dynamics dat…
▽ More
The Physics-Constrained DeepONet (PC-DeepONet), an architecture that incorporates fundamental physics knowledge into the data-driven DeepONet model, is presented in this study. This methodology is exemplified through surrogate modeling of fluid dynamics over a curved backward-facing step, a benchmark problem in computational fluid dynamics. The model was trained on computational fluid dynamics data generated for a range of parameterized geometries. The PC-DeepONet was able to learn the mapping from the parameters describing the geometry to the velocity and pressure fields. While the DeepONet is solely data-driven, the PC-DeepONet imposes the divergence constraint from the continuity equation onto the network. The PC-DeepONet demonstrates higher accuracy than the data-driven baseline, especially when trained on sparse data. Both models attain convergence with a small dataset of 50 samples and require only 50 iterations for convergence, highlighting the efficiency of neural operators in learning the dynamics governed by partial differential equations.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Riemann Tensor Neural Networks: Learning Conservative Systems with Physics-Constrained Networks
Authors:
Anas Jnini,
Lorenzo Breschi,
Flavio Vella
Abstract:
Divergence-free symmetric tensors (DFSTs) are fundamental in continuum mechanics, encoding conservation laws such as mass and momentum conservation. We introduce Riemann Tensor Neural Networks (RTNNs), a novel neural architecture that inherently satisfies the DFST condition to machine precision, providing a strong inductive bias for enforcing these conservation laws. We prove that RTNNs can approx…
▽ More
Divergence-free symmetric tensors (DFSTs) are fundamental in continuum mechanics, encoding conservation laws such as mass and momentum conservation. We introduce Riemann Tensor Neural Networks (RTNNs), a novel neural architecture that inherently satisfies the DFST condition to machine precision, providing a strong inductive bias for enforcing these conservation laws. We prove that RTNNs can approximate any sufficiently smooth DFST with arbitrary precision and demonstrate their effectiveness as surrogates for conservative PDEs, achieving improved accuracy across benchmarks. This work is the first to use DFSTs as an inductive bias in neural PDE surrogates and to explicitly enforce the conservation of both mass and momentum within a physics-constrained neural architecture.
△ Less
Submitted 16 June, 2025; v1 submitted 2 March, 2025;
originally announced March 2025.
-
Popcorn: Accelerating Kernel K-means on GPUs through Sparse Linear Algebra
Authors:
Julian Bellavita,
Thomas Pasquali,
Laura Del Rio Martin,
Flavio Vella,
Giulia Guidi
Abstract:
K-means is a popular clustering algorithm with significant applications in numerous scientific and engineering areas. One drawback of K-means is its inability to identify non-linearly separable clusters, which may lead to inaccurate solutions in certain cases. Kernel K-means is a variant of classical K-means that can find non-linearly separable clusters. However, it scales quadratically with respe…
▽ More
K-means is a popular clustering algorithm with significant applications in numerous scientific and engineering areas. One drawback of K-means is its inability to identify non-linearly separable clusters, which may lead to inaccurate solutions in certain cases. Kernel K-means is a variant of classical K-means that can find non-linearly separable clusters. However, it scales quadratically with respect to the size of the dataset, taking several minutes to cluster even medium-sized datasets on traditional CPU-based machines. In this paper, we present a formulation of Kernel K-means using sparse-dense matrix multiplication (SpMM) and sparse matrix-vector multiplication (SpMV), and we show that our formulation enables the rapid implementation of a fast GPU-based version of Kernel K-means with little programming effort. Our implementation, named Popcorn, is the first open-source GPU-based implementation of Kernel K-means. Popcorn achieves a speedup of up to 123.8x over a CPU implementation of Kernel K-means and a speedup of up to 2.6x over a GPU implementation of Kernel K-means that does not use sparse matrix computations. Our results support the effectiveness of sparse matrices as tools for efficient parallel programming.
△ Less
Submitted 9 January, 2025;
originally announced January 2025.
-
The GECo algorithm for Graph Neural Networks Explanation
Authors:
Salvatore Calderaro,
Domenico Amato,
Giosuè Lo Bosco,
Riccardo Rizzo,
Filippo Vella
Abstract:
Graph Neural Networks (GNNs) are powerful models that can manage complex data sources and their interconnection links. One of GNNs' main drawbacks is their lack of interpretability, which limits their application in sensitive fields. In this paper, we introduce a new methodology involving graph communities to address the interpretability of graph classification problems. The proposed method, calle…
▽ More
Graph Neural Networks (GNNs) are powerful models that can manage complex data sources and their interconnection links. One of GNNs' main drawbacks is their lack of interpretability, which limits their application in sensitive fields. In this paper, we introduce a new methodology involving graph communities to address the interpretability of graph classification problems. The proposed method, called GECo, exploits the idea that if a community is a subset of graph nodes densely connected, this property should play a role in graph classification. This is reasonable, especially if we consider the message-passing mechanism, which is the basic mechanism of GNNs. GECo analyzes the contribution to the classification result of the communities in the graph, building a mask that highlights graph-relevant structures. GECo is tested for Graph Convolutional Networks on six artificial and four real-world graph datasets and is compared to the main explainability methods such as PGMExplainer, PGExplainer, GNNExplainer, and SubgraphX using four different metrics. The obtained results outperform the other methods for artificial graph datasets and most real-world datasets.
△ Less
Submitted 18 November, 2024;
originally announced November 2024.
-
The Landscape of GPU-Centric Communication
Authors:
Didem Unat,
Ilyas Turimbetov,
Mohammed Kefah Taha Issa,
Doğan Sağbili,
Flavio Vella,
Daniele De Sensi,
Ismayil Ismayilov
Abstract:
In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now ch…
▽ More
In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now challenge this CPU dominance by reducing its involvement, granting GPUs more autonomy in communication tasks, and addressing mismatches in multi-GPU communication and computation.
This paper provides a landscape of GPU-centric communication, focusing on vendor mechanisms and user-level library supports. It aims to clarify the complexities and diverse options in this field, define the terminology, and categorize existing approaches within and across nodes. The paper discusses vendor-provided mechanisms for communication and memory management in multi-GPU execution and reviews major communication libraries, their benefits, challenges, and performance insights. Then, it explores key research paradigms, future outlooks, and open research questions. By extensively describing GPU-centric communication techniques across the software and hardware stacks, we provide researchers, programmers, engineers, and library designers insights on how to exploit multi-GPU systems at their best.
△ Less
Submitted 22 February, 2026; v1 submitted 15 September, 2024;
originally announced September 2024.
-
Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects
Authors:
Daniele De Sensi,
Lorenzo Pichetti,
Flavio Vella,
Tiziano De Matteis,
Zebin Ren,
Luigi Fusco,
Matteo Turisini,
Daniele Cesarini,
Kurt Lust,
Animesh Trivedi,
Duncan Roweth,
Filippo Spiga,
Salvatore Di Girolamo,
Torsten Hoefler
Abstract:
Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This pape…
▽ More
Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.
△ Less
Submitted 15 November, 2024; v1 submitted 26 August, 2024;
originally announced August 2024.
-
High Performance Unstructured SpMM Computation Using Tensor Cores
Authors:
Patrik Okanovic,
Grzegorz Kwasniewski,
Paolo Sylos Labini,
Maciej Besta,
Flavio Vella,
Torsten Hoefler
Abstract:
High-performance sparse matrix-matrix (SpMM) multiplication is paramount for science and industry, as the ever-increasing sizes of data prohibit using dense data structures. Yet, existing hardware, such as Tensor Cores (TC), is ill-suited for SpMM, as it imposes strict constraints on data structures that cannot be met by unstructured sparsity found in many applications. To address this, we introdu…
▽ More
High-performance sparse matrix-matrix (SpMM) multiplication is paramount for science and industry, as the ever-increasing sizes of data prohibit using dense data structures. Yet, existing hardware, such as Tensor Cores (TC), is ill-suited for SpMM, as it imposes strict constraints on data structures that cannot be met by unstructured sparsity found in many applications. To address this, we introduce (S)parse (Ma)trix Matrix (T)ensor Core-accelerated (SMaT): a novel SpMM library that utilizes TCs for unstructured sparse matrices. Our block-sparse library leverages the low-level CUDA MMA (matrix-matrix-accumulate) API, maximizing the performance offered by modern GPUs. Algorithmic optimizations such as sparse matrix permutation further improve performance by minimizing the number of non-zero blocks. The evaluation on NVIDIA A100 shows that SMaT outperforms SotA libraries (DASP, cuSPARSE, and Magicube) by up to 125x (on average 2.6x). SMaT can be used to accelerate many workloads in scientific computing, large-model training, inference, and others.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
Dependable Classical-Quantum Computer Systems Engineering
Authors:
Edoardo Giusto,
Santiago Nuñez-Corrales,
Phuong Cao,
Alessandro Cilardo,
Ravishankar K. Iyer,
Weiwen Jiang,
Paolo Rech,
Flavio Vella,
Bartolomeo Montrucchio,
Samudra Dasgupta,
Travis S. Humble
Abstract:
Quantum Computing (QC) offers the potential to enhance traditional High-Performance Computing (HPC) workloads by leveraging the unique properties of quantum computers, leading to the emergence of a new paradigm: HPC-QC. While this integration presents new opportunities, it also brings novel challenges, particularly in ensuring the dependability of such hybrid systems. This paper aims to identify i…
▽ More
Quantum Computing (QC) offers the potential to enhance traditional High-Performance Computing (HPC) workloads by leveraging the unique properties of quantum computers, leading to the emergence of a new paradigm: HPC-QC. While this integration presents new opportunities, it also brings novel challenges, particularly in ensuring the dependability of such hybrid systems. This paper aims to identify integration challenges, anticipate failures, and foster a diverse co-design for HPC-QC systems by bringing together QC, cloud computing, HPC, and network security. The focus of this emerging inter-disciplinary effort is to develop engineering principles that ensure the dependability of hybrid systems, aiming for a more prescriptive co-design cycle. Our framework will help to prevent design pitfalls and accelerate the maturation of the QC technology ecosystem. Key aspects include building resilient HPC-QC systems, analyzing the applicability of conventional techniques to the quantum domain, and exploring the complexity of scaling in such hybrid systems. This underscores the need for performance-reliability metrics specific to this new computational paradigm.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
cuVegas: Accelerate Multidimensional Monte Carlo Integration through a Parallelized CUDA-based Implementation of the VEGAS Enhanced Algorithm
Authors:
Emiliano Tolotti,
Anas Jnini,
Flavio Vella,
Roberto Passerone
Abstract:
This paper introduces cuVegas, a CUDA-based implementation of the Vegas Enhanced Algorithm (VEGAS+), optimized for multi-dimensional integration in GPU environments. The VEGAS+ algorithm is an advanced form of Monte Carlo integration, recognized for its adaptability and effectiveness in handling complex, high-dimensional integrands. It employs a combination of variance reduction techniques, namely…
▽ More
This paper introduces cuVegas, a CUDA-based implementation of the Vegas Enhanced Algorithm (VEGAS+), optimized for multi-dimensional integration in GPU environments. The VEGAS+ algorithm is an advanced form of Monte Carlo integration, recognized for its adaptability and effectiveness in handling complex, high-dimensional integrands. It employs a combination of variance reduction techniques, namely adaptive importance sampling and a variant of adaptive stratified sampling, that make it particularly adept at managing integrands with multiple peaks or those aligned with the diagonals of the integration volume. Being a Monte Carlo integration method, the task is well suited for parallelization and for GPU execution. Our implementation, cuVegas, aims to harness the inherent parallelism of GPUs, addressing the challenge of workload distribution that often hampers efficiency in standard implementations. We present a comprehensive analysis comparing cuVegas with existing CPU and GPU implementations, demonstrating significant performance improvements, from two to three orders of magnitude on CPUs, and from a factor of two on GPUs over the best existing implementation. We also demonstrate the speedup for integrands for which VEGAS+ was designed, with multiple peaks or other significant structures aligned with diagonals of the integration volume.
△ Less
Submitted 17 August, 2024;
originally announced August 2024.
-
Leveraging Large Language Models for enhanced personalised user experience in Smart Homes
Authors:
Jordan Rey-Jouanchicot,
André Bottaro,
Eric Campo,
Jean-Léon Bouraoui,
Nadine Vigouroux,
Frédéric Vella
Abstract:
Smart home automation systems aim to improve the comfort and convenience of users in their living environment. However, adapting automation to user needs remains a challenge. Indeed, many systems still rely on hand-crafted routines for each smart object.This paper presents an original smart home architecture leveraging Large Language Models (LLMs) and user preferences to push the boundaries of per…
▽ More
Smart home automation systems aim to improve the comfort and convenience of users in their living environment. However, adapting automation to user needs remains a challenge. Indeed, many systems still rely on hand-crafted routines for each smart object.This paper presents an original smart home architecture leveraging Large Language Models (LLMs) and user preferences to push the boundaries of personalisation and intuitiveness in the home environment.This article explores a human-centred approach that uses the general knowledge provided by LLMs to learn and facilitate interactions with the environment.The advantages of the proposed model are demonstrated on a set of scenarios, as well as a comparative analysis with various LLM implementations. Some metrics are assessed to determine the system's ability to maintain comfort, safety, and user preferences. The paper details the approach to real-world implementation and evaluation.The proposed approach of using preferences shows up to 52.3% increase in average grade, and with an average processing time reduced by 35.6% on Starling 7B Alpha LLM. In addition, performance is 26.4% better than the results of the larger models without preferences, with processing time almost 20 times faster.
△ Less
Submitted 28 June, 2024;
originally announced July 2024.
-
On the Efficacy of Surface Codes in Compensating for Radiation Events in Superconducting Devices
Authors:
Marzio Vallero,
Gioele Casagranda,
Flavio Vella,
Paolo Rech
Abstract:
Reliability is fundamental for developing large-scale quantum computers. Since the benefit of technological advancements to the qubit's stability is saturating, algorithmic solutions, such as quantum error correction (QEC) codes, are needed to bridge the gap to reliable computation. Unfortunately, the deployment of the first quantum computers has identified faults induced by natural radiation as a…
▽ More
Reliability is fundamental for developing large-scale quantum computers. Since the benefit of technological advancements to the qubit's stability is saturating, algorithmic solutions, such as quantum error correction (QEC) codes, are needed to bridge the gap to reliable computation. Unfortunately, the deployment of the first quantum computers has identified faults induced by natural radiation as an additional threat to qubits reliability. The high sensitivity of qubits to radiation hinders the large-scale adoption of quantum computers, since the persistence and area-of-effect of the fault can potentially undermine the efficacy of the most advanced QEC.
In this paper, we investigate the resilience of various implementations of state-of-the-art QEC codes to radiation-induced faults. We report data from over 400 million fault injections and correlate hardware faults with the logical error observed after decoding the code output, extrapolating physical-to-logical error rates. We compare the code's radiation-induced logical error rate over the code distance, the number and role in the QEC of physical qubits, the underlying quantum computer topology, and particle energy spread in the chip. We show that, by simply selecting and tuning properly the surface code, thus without introducing any overhead, the probability of correcting a radiation-induced fault is increased by up to 10\%. Finally, we provide indications and guidelines for the design of future QEC codes to further increase their effectiveness against radiation-induced events.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
State of practice: evaluating GPU performance of state vector and tensor network methods
Authors:
Marzio Vallero,
Flavio Vella,
Paolo Rech
Abstract:
The frontier of quantum computing (QC) simulation on classical hardware is quickly reaching the hard scalability limits for computational feasibility. Nonetheless, there is still a need to simulate large quantum systems classically, as the Noisy Intermediate Scale Quantum (NISQ) devices are yet to be considered fault tolerant and performant enough in terms of operations per second. Each of the two…
▽ More
The frontier of quantum computing (QC) simulation on classical hardware is quickly reaching the hard scalability limits for computational feasibility. Nonetheless, there is still a need to simulate large quantum systems classically, as the Noisy Intermediate Scale Quantum (NISQ) devices are yet to be considered fault tolerant and performant enough in terms of operations per second. Each of the two main exact simulation techniques, state vector and tensor network simulators, boasts specific limitations. The exponential memory requirement of state vector simulation, when compared to the qubit register sizes of currently available quantum computers, quickly saturates the capacity of the top HPC machines currently available. Tensor network contraction approaches, which encode quantum circuits into tensor networks and then contract them over an output bit string to obtain its probability amplitude, still fall short of the inherent complexity of finding an optimal contraction path, which maps to a max-cut problem on a dense mesh, a notably NP-hard problem.
This article aims at investigating the limits of current state-of-the-art simulation techniques on a test bench made of eight widely used quantum subroutines, each in 31 different configurations, with special emphasis on performance. We then correlate the performance measures of the simulators with the metrics that characterise the benchmark circuits, identifying the main reasons behind the observed performance trend. From our observations, given the structure of a quantum circuit and the number of qubits, we highlight how to select the best simulation strategy, obtaining a speedup of up to an order of magnitude.
△ Less
Submitted 3 February, 2025; v1 submitted 11 January, 2024;
originally announced January 2024.
-
HandiMathKey-Device
Authors:
Frédéric Vella,
Nathalie Dubus,
Eloise Grolleau,
Marjorie Deleau,
Cécile Malet,
Christine Gallard,
Véronique Ades,
Nadine Vigouroux
Abstract:
Typing mathematics is sometimes difficult with text editor functions for students with motor impairment and other associated impairments (visual, cognitive). Based on the HandiMathKey software keyboard, a user-centred design method involving the ecosytem of disabled students was applied to design the HMK-D physical keyboard for mathematical input. We opted for the Stream Deck device because of its…
▽ More
Typing mathematics is sometimes difficult with text editor functions for students with motor impairment and other associated impairments (visual, cognitive). Based on the HandiMathKey software keyboard, a user-centred design method involving the ecosytem of disabled students was applied to design the HMK-D physical keyboard for mathematical input. We opted for the Stream Deck device because of its multimedia features and its appeal to young students to the HMK-D. Preliminary tests with 8 students (5 in secondary school and 3 in high school) shows that HMK-D is highly accepted, accessible and fun for mathematical input by students with impairments. A longitudinal study of the usability and acceptability of HMK-D is planned for the 2023-2024 school year.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
A first step towards an ecosystem meta-model for humancentered design in case of disabled users
Authors:
Christophe Kolski,
Nadine Vigouroux,
Yohan Guerrier,
Frédéric Vella,
Marine Guffroy
Abstract:
The involvement of the ecosystem or social environment of the disabled user is considered as very useful and even essential for the human-centered design of assistive technologies. In the era of model-based approaches, the modeling of the ecosystem is therefore to be considered. The first version of a metamodel of ecosystem is proposed. It is illustrated through a first case study. It concerns a p…
▽ More
The involvement of the ecosystem or social environment of the disabled user is considered as very useful and even essential for the human-centered design of assistive technologies. In the era of model-based approaches, the modeling of the ecosystem is therefore to be considered. The first version of a metamodel of ecosystem is proposed. It is illustrated through a first case study. It concerns a project aiming at a communication aid for people with cerebral palsy. A conclusion and research perspectives end this paper.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
Design Recommendations Based on Speech Analysis for Disability-Friendly Interfaces for the Control of a Home Automation Environment
Authors:
Nadine Vigouroux,
Frédéric Vella,
Gaëlle Lepage,
Éric Campo
Abstract:
The objective of this paper is to describe the study on speech interaction mode for home automation control of equipment by impaired people for an inclusive housing. The study is related to the HIP HOPE project concerning a building of 19 inclusive housing units. 7 participants with different types of disabilities were invited to carry out use cases using voice and touch control. Only the results…
▽ More
The objective of this paper is to describe the study on speech interaction mode for home automation control of equipment by impaired people for an inclusive housing. The study is related to the HIP HOPE project concerning a building of 19 inclusive housing units. 7 participants with different types of disabilities were invited to carry out use cases using voice and touch control. Only the results obtained on the voice interaction mode through the Amazon voice assistant are reported here. The results show, according to the type of handicap, the success rates in the speech recognition of the command emitted on the equipment and highlight the errors related to the formulation, the noisy environment, the intelligible speech, the speech segmentation and the bad synchronization of the audio channel opening.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
Scaling Expected Force: Efficient Identification of Key Nodes in Network-based Epidemic Models
Authors:
Paolo Sylos Labini,
Andrej Jurco,
Matteo Ceccarello,
Stefano Guarino,
Enrico Mastrostefano,
Flavio Vella
Abstract:
Centrality measures are fundamental tools of network analysis as they highlight the key actors within the network. This study focuses on a newly proposed centrality measure, Expected Force (EF), and its use in identifying spreaders in network-based epidemic models. We found that EF effectively predicts the spreading power of nodes and identifies key nodes and immunization targets. However, its hig…
▽ More
Centrality measures are fundamental tools of network analysis as they highlight the key actors within the network. This study focuses on a newly proposed centrality measure, Expected Force (EF), and its use in identifying spreaders in network-based epidemic models. We found that EF effectively predicts the spreading power of nodes and identifies key nodes and immunization targets. However, its high computational cost presents a challenge for its use in large networks. To overcome this limitation, we propose two parallel scalable algorithms for computing EF scores: the first algorithm is based on the original formulation, while the second one focuses on a cluster-centric approach to improve efficiency and scalability. Our implementations significantly reduce computation time, allowing for the detection of key nodes at large scales. Performance analysis on synthetic and real-world networks demonstrates that the GPU implementation of our algorithm can efficiently scale to networks with up to 44 million edges by exploiting modern parallel architectures, achieving speed-ups of up to 300x, and 50x on average, compared to the simple parallel solution.
△ Less
Submitted 1 June, 2023;
originally announced June 2023.
-
Multi-GPU aggregation-based AMG preconditioner for iterative linear solvers
Authors:
Massimo Bernaschi,
Alessandro Celestini,
Pasqua D'Ambra,
Flavio Vella
Abstract:
We present and release in open source format a sparse linear solver which efficiently exploits heterogeneous parallel computers. The solver can be easily integrated into scientific applications that need to solve large and sparse linear systems on modern parallel computers made of hybrid nodes hosting NVIDIA Graphics Processing Unit (GPU) accelerators.
The work extends our previous efforts in th…
▽ More
We present and release in open source format a sparse linear solver which efficiently exploits heterogeneous parallel computers. The solver can be easily integrated into scientific applications that need to solve large and sparse linear systems on modern parallel computers made of hybrid nodes hosting NVIDIA Graphics Processing Unit (GPU) accelerators.
The work extends our previous efforts in the exploitation of a single GPU accelerator and proposes an implementation, based on the hybrid MPI-CUDA software environment, of a Krylov-type linear solver relying on an efficient Algebraic MultiGrid (AMG) preconditioner already available in the BootCMatchG library. Our design for the hybrid implementation has been driven by the best practices for minimizing data communication overhead when multiple GPUs are employed, yet preserving the efficiency of the single GPU kernels. Strong and weak scalability results on well-known benchmark test cases of the new version of the library are discussed. Comparisons with the Nvidia AmgX solution show an improvement of up to 2.0x in the solve phase.
△ Less
Submitted 4 March, 2023;
originally announced March 2023.
-
Towards a learning-based performance modeling for accelerating Deep Neural Networks
Authors:
Damiano Perri,
Paolo Sylos Labini,
Osvaldo Gervasi,
Sergio Tasso,
Flavio Vella
Abstract:
Emerging applications such as Deep Learning are often data-driven, thus traditional approaches based on auto-tuners are not performance effective across the wide range of inputs used in practice. In the present paper, we start an investigation of predictive models based on machine learning techniques in order to optimize Convolution Neural Networks (CNNs). As a use-case, we focus on the ARM Comput…
▽ More
Emerging applications such as Deep Learning are often data-driven, thus traditional approaches based on auto-tuners are not performance effective across the wide range of inputs used in practice. In the present paper, we start an investigation of predictive models based on machine learning techniques in order to optimize Convolution Neural Networks (CNNs). As a use-case, we focus on the ARM Compute Library which provides three different implementations of the convolution operator at different numeric precision. Starting from a collation of benchmarks, we build and validate models learned by Decision Tree and naive Bayesian classifier. Preliminary experiments on Midgard-based ARM Mali GPU show that our predictive model outperforms all the convolution operators manually selected by the library.
△ Less
Submitted 9 December, 2022;
originally announced December 2022.
-
User Centred Method to Design a Platform to Design Augmentative and Alternative Communication Assistive Technologies
Authors:
Frédéric Vella,
Flavien Clastres-Babou,
Nadine Vigouroux,
Philippe Truillet,
Charline Calmels,
Caroline Mercadier,
Karine Gigaud,
Margot Issanchou,
Kristina Gourinovitch,
Anne Garaix
Abstract:
We describe a co-design approach to design the online WebSoKeyTo used to design AAC. This co-design was carried out between a team of therapists and a team of human-computer interaction researchers. Our approach begins with the use and evaluation of an existing SoKeyTo AAC design application. This step was essential in the awareness and definition of the needs by the therapists and in the understa…
▽ More
We describe a co-design approach to design the online WebSoKeyTo used to design AAC. This co-design was carried out between a team of therapists and a team of human-computer interaction researchers. Our approach begins with the use and evaluation of an existing SoKeyTo AAC design application. This step was essential in the awareness and definition of the needs by the therapists and in the understanding of the poor usability scores of SoKeyTo by the researchers. We then describe the various phases (focus group, brainstorming, prototyping) with the co-design choices retained. An evaluation of WebSoKeyTo is in progress.
△ Less
Submitted 23 November, 2022;
originally announced November 2022.
-
Participation of Stakeholder in the Design of a Conception Application of Augmentative and Alternative Communication
Authors:
Frédéric Vella,
Flavien Clastres-Babou,
Frédéric Vella,
Nadine Vigouroux,
Philippe Truillet,
Nadine Vigouroux,
Charline Calmels,
Caroline Mercadier,
Karine Gigaud,
Margot Issanchou,
Kristina Gourinovitch,
Anne Garaix
Abstract:
The objective of this paper is to describe the implication of an interdisciplinary team involved during a user-centered design methodology to design the platform (WebSoKeyTo) that meets the needs of therapists to design augmentative and alternative communication (AAC) aids for disabled users. We describe the processes of the design process and the role of the various actors (therapists and human c…
▽ More
The objective of this paper is to describe the implication of an interdisciplinary team involved during a user-centered design methodology to design the platform (WebSoKeyTo) that meets the needs of therapists to design augmentative and alternative communication (AAC) aids for disabled users. We describe the processes of the design process and the role of the various actors (therapists and human computer researchers) in the various phases of the process. Finally, we analyze a satisfaction scale of the therapists on their participation in the codesign process. This study demonstrates the interest in extending the design actors to other therapists and caregivers (professional and family) in the daily life of people with disabilities.
△ Less
Submitted 23 November, 2022;
originally announced November 2022.
-
IDEALI: intuitively localising connected devices in order to support autonomy
Authors:
Frédéric Vella,
Réjane Dalcé,
Antonio Serpa,
Thierry Val,
Adrien van Den Bossche,
Frédéric Vella,
Nadine Vigouroux
Abstract:
The ability to localise a smart device is very useful to visually or cognitively impaired people. Localisation-capable technologies are becoming more readily available as off-the-shelf components. In this paper, we highlight the need for such a service in the field of health and autonomy, especially for disabled people. We introduce a model for Semantic Position Description (SPD) (e.g. "The pill o…
▽ More
The ability to localise a smart device is very useful to visually or cognitively impaired people. Localisation-capable technologies are becoming more readily available as off-the-shelf components. In this paper, we highlight the need for such a service in the field of health and autonomy, especially for disabled people. We introduce a model for Semantic Position Description (SPD) (e.g. "The pill organiser in on the kitchen table") as well as various algorithms that transform raw distance estimations to SPD related to proximity, alignment and room identification. Two of these algorithms are evaluated using the LocURa4IoT testbed. The results are compared to the output of a pre-experiment involving ten human participants in the Maison Intelligente de Blagnac. The two studies indicate that both approaches converge up to 90% of the time. .
△ Less
Submitted 23 November, 2022;
originally announced November 2022.
-
Usability Study of Tactile and Voice Interaction Modes by People with Disabilities for Home Automation Controls
Authors:
Nadine Vigouroux,
Frédéric Vella,
Gaëlle Lepage,
Eric Campo
Abstract:
This paper presents a comparative usability study on tactile and vocal interaction modes for home automation control of equipment at home for different profiles of disabled people. The study is related to the HIP HOPE project concerning the construction of 19 inclusive housing in the Toulouse metropolitan area in France. The experimentation took place in a living lab with 7 different disabled peop…
▽ More
This paper presents a comparative usability study on tactile and vocal interaction modes for home automation control of equipment at home for different profiles of disabled people. The study is related to the HIP HOPE project concerning the construction of 19 inclusive housing in the Toulouse metropolitan area in France. The experimentation took place in a living lab with 7 different disabled people who realize realistic use cases. The USE and UEQ questionnaires were selected as usability tools. The first results show that both interfaces are easy to learn but that usefulness and ease of use dimensions need to be improved. This study shows that there is real need for multimodality between touch and voice interaction to control the smart home. This study also shows that there is need to adapt the interface and the environment to the person's disability.
△ Less
Submitted 23 November, 2022;
originally announced November 2022.
-
ProbGraph: High-Performance and High-Accuracy Graph Mining with Probabilistic Set Representations
Authors:
Maciej Besta,
Cesare Miglioli,
Paolo Sylos Labini,
Jakub Tětek,
Patrick Iff,
Raghavendra Kanakagiri,
Saleh Ashkboos,
Kacper Janda,
Michal Podstawski,
Grzegorz Kwasniewski,
Niels Gleinig,
Flavio Vella,
Onur Mutlu,
Torsten Hoefler
Abstract:
Important graph mining problems such as Clustering are computationally demanding. To significantly accelerate these problems, we propose ProbGraph: a graph representation that enables simple and fast approximate parallel graph mining with strong theoretical guarantees on work, depth, and result accuracy. The key idea is to represent sets of vertices using probabilistic set representations such as…
▽ More
Important graph mining problems such as Clustering are computationally demanding. To significantly accelerate these problems, we propose ProbGraph: a graph representation that enables simple and fast approximate parallel graph mining with strong theoretical guarantees on work, depth, and result accuracy. The key idea is to represent sets of vertices using probabilistic set representations such as Bloom filters. These representations are much faster to process than the original vertex sets thanks to vectorizability and small size. We use these representations as building blocks in important parallel graph mining algorithms such as Clique Counting or Clustering. When enhanced with ProbGraph, these algorithms significantly outperform tuned parallel exact baselines (up to nearly 50x on 32 cores) while ensuring accuracy of more than 90% for many input graph datasets. Our novel bounds and algorithms based on probabilistic set representations with desirable statistical properties are of separate interest for the data analytics community.
△ Less
Submitted 21 November, 2022; v1 submitted 24 August, 2022;
originally announced August 2022.
-
Asynchronous Distributed-Memory Triangle Counting and LCC with RMA Caching
Authors:
András Strausz,
Flavio Vella,
Salvatore Di Girolamo,
Maciej Besta,
Torsten Hoefler
Abstract:
Triangle count and local clustering coefficient are two core metrics for graph analysis. They find broad application in analyses such as community detection and link recommendation. Current state-of-the-art solutions suffer from synchronization overheads or expensive pre-computations needed to distribute the graph, achieving limited scaling capabilities. We propose a fully asynchronous implementat…
▽ More
Triangle count and local clustering coefficient are two core metrics for graph analysis. They find broad application in analyses such as community detection and link recommendation. Current state-of-the-art solutions suffer from synchronization overheads or expensive pre-computations needed to distribute the graph, achieving limited scaling capabilities. We propose a fully asynchronous implementation for triangle counting and local clustering coefficient based on 1D partitioning, using remote memory accesses for transferring data and avoid synchronization. Additionally, we show how these algorithms present data reuse on remote memory accesses and how the overall communication time can be improved by caching these accesses. Finally, we extend CLaMPI, a software-layer caching system for MPI RMA, to include application-specific scores for cached entries and influence the eviction procedure to improve caching efficiency. Our results show improvements on shared memory, and we achieve 14x speedup from 4 to 64 nodes for the LiveJournal1 graph on distributed memory. Moreover, we demonstrate how caching remote accesses reduces total running time by up to 73% with respect to a non-cached version. Finally, we compare our implementation to TriC, the 2020 graph champion paper, and achieve up to 100x faster results for scale-free graphs.
△ Less
Submitted 1 March, 2022; v1 submitted 28 February, 2022;
originally announced February 2022.
-
Blocking Techniques for Sparse Matrix Multiplication on Tensor Accelerators
Authors:
Paolo Sylos Labini,
Massimo Bernaschi,
Francesco Silvestri,
Flavio Vella
Abstract:
Tensor accelerators have gained popularity because they provide a cheap and efficient solution for speeding up computational-expensive tasks in Deep Learning and, more recently, in other Scientific Computing applications. However, since their features are specifically designed for tensor algebra (typically dense matrix-product), it is commonly assumed that they are not suitable for applications wi…
▽ More
Tensor accelerators have gained popularity because they provide a cheap and efficient solution for speeding up computational-expensive tasks in Deep Learning and, more recently, in other Scientific Computing applications. However, since their features are specifically designed for tensor algebra (typically dense matrix-product), it is commonly assumed that they are not suitable for applications with sparse data. To challenge this viewpoint, we discuss methods and present solutions for accelerating sparse matrix multiplication on such architectures. In particular, we present a 1-dimensional blocking algorithm with theoretical guarantees on the density, which builds dense blocks from arbitrary sparse matrices. Experimental results show that, even for unstructured and highly-sparse matrices, our block-based solution which exploits Nvidia Tensor Cores is faster than its sparse counterpart. We observed significant speed-ups of up to two orders of magnitude on real-world sparse matrices.
△ Less
Submitted 11 February, 2022;
originally announced February 2022.
-
GPU-based parallelism for ASP-solving
Authors:
Agostino Dovier,
Andrea Formisano,
Flavio Vella
Abstract:
Answer Set Programming (ASP) has become, the paradigm of choice in the field of logic programming and non-monotonic reasoning. Thanks to the availability of efficient solvers, ASP has been successfully employed in a large number of application domains. The term GPU-computing indicates a recent programming paradigm aimed at enabling the use of modern parallel Graphical Processing Units (GPUs) for g…
▽ More
Answer Set Programming (ASP) has become, the paradigm of choice in the field of logic programming and non-monotonic reasoning. Thanks to the availability of efficient solvers, ASP has been successfully employed in a large number of application domains. The term GPU-computing indicates a recent programming paradigm aimed at enabling the use of modern parallel Graphical Processing Units (GPUs) for general purpose computing. In this paper we describe an approach to ASP-solving that exploits GPU parallelism. The design of a GPU-based solver poses various challenges due to the peculiarities of GPUs' software and hardware architectures and to the intrinsic nature of the satisfiability problem.
△ Less
Submitted 4 September, 2019;
originally announced September 2019.
-
A Computational Model for Tensor Core Units
Authors:
Rezaul Chowdhury,
Francesco Silvestri,
Flavio Vella
Abstract:
To respond to the need of efficient training and inference of deep neural networks, a plethora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for efficiently computing a dense matrix multiplication of a given small size. In order to broaden the class of alg…
▽ More
To respond to the need of efficient training and inference of deep neural networks, a plethora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for efficiently computing a dense matrix multiplication of a given small size. In order to broaden the class of algorithms that exploit these systems, we propose a computational model, named the TCU model, that captures the ability to natively multiply small matrices. We then use the TCU model for designing fast algorithms for several problems, including matrix operations (dense and sparse multiplication, Gaussian Elimination), graph algorithms (transitive closure, all pairs shortest distances), Discrete Fourier Transform, stencil computations, integer multiplication, and polynomial evaluation. We finally highlight a relation between the TCU model and the external memory model.
△ Less
Submitted 9 July, 2020; v1 submitted 19 August, 2019;
originally announced August 2019.
-
A model-driven approach for a new generation of adaptive libraries
Authors:
Marco Cianfriglia,
Flavio Vella,
Cedric Nugteren,
Anton Lokhmotov,
Grigori Fursin
Abstract:
Efficient high-performance libraries often expose multiple tunable parameters to provide highly optimized routines. These can range from simple loop unroll factors or vector sizes all the way to algorithmic changes, given that some implementations can be more suitable for certain devices by exploiting hardware characteristics such as local memories and vector units. Traditionally, such parameters…
▽ More
Efficient high-performance libraries often expose multiple tunable parameters to provide highly optimized routines. These can range from simple loop unroll factors or vector sizes all the way to algorithmic changes, given that some implementations can be more suitable for certain devices by exploiting hardware characteristics such as local memories and vector units. Traditionally, such parameters and algorithmic choices are tuned and then hard-coded for a specific architecture and for certain characteristics of the inputs. However, emerging applications are often data-driven, thus traditional approaches are not effective across the wide range of inputs and architectures used in practice. In this paper, we present a new adaptive framework for data-driven applications which uses a predictive model to select the optimal algorithmic parameters by training with synthetic and real datasets. We demonstrate the effectiveness of a BLAS library and specifically on its matrix multiplication routine. We present experimental results for two GPU architectures and show significant performance gains of up to 3x (on a high-end NVIDIA Pascal GPU) and 2.5x (on an embedded ARM Mali GPU) when compared to a traditionally optimized library.
△ Less
Submitted 19 June, 2018;
originally announced June 2018.
-
Accelerating Energy Games Solvers on Modern Architectures
Authors:
Andrea Formisano,
Raffaella Gentilini,
Flavio Vella
Abstract:
Quantitative games, where quantitative objectives are defined on weighted game arenas, provide natural tools for designing faithful models of embedded controllers. Instances of these games that recently gained interest are the so called Energy Games. The fast-known algorithm solves Energy Games in O(EVW) where W is the maximum weight. Starting from a sequential baseline implementation, we investig…
▽ More
Quantitative games, where quantitative objectives are defined on weighted game arenas, provide natural tools for designing faithful models of embedded controllers. Instances of these games that recently gained interest are the so called Energy Games. The fast-known algorithm solves Energy Games in O(EVW) where W is the maximum weight. Starting from a sequential baseline implementation, we investigate the use of massively data computation capabilities supported by modern Graphics Processing Units to solve the `initial credit problem' for Energy Games. We present four different parallel implementations on multi-core CPU and GPU systems. Our solution outperforms the baseline implementation by up to 36x speedup and obtains a faster convergence time on real-world graphs.
△ Less
Submitted 10 October, 2017;
originally announced October 2017.
-
Creative Robot Dance with Variational Encoder
Authors:
Agnese Augello,
Emanuele Cipolla,
Ignazio Infantino,
Adriano Manfre,
Giovanni Pilato,
Filippo Vella
Abstract:
What we appreciate in dance is the ability of people to sponta- neously improvise new movements and choreographies, sur- rendering to the music rhythm, being inspired by the cur- rent perceptions and sensations and by previous experiences, deeply stored in their memory. Like other human abilities, this, of course, is challenging to reproduce in an artificial entity such as a robot. Recent generati…
▽ More
What we appreciate in dance is the ability of people to sponta- neously improvise new movements and choreographies, sur- rendering to the music rhythm, being inspired by the cur- rent perceptions and sensations and by previous experiences, deeply stored in their memory. Like other human abilities, this, of course, is challenging to reproduce in an artificial entity such as a robot. Recent generations of anthropomor- phic robots, the so-called humanoids, however, exhibit more and more sophisticated skills and raised the interest in robotic communities to design and experiment systems devoted to automatic dance generation. In this work, we highlight the importance to model a computational creativity behavior in dancing robots to avoid a mere execution of preprogrammed dances. In particular, we exploit a deep learning approach that allows a robot to generate in real time new dancing move- ments according to to the listened music.
△ Less
Submitted 5 July, 2017;
originally announced July 2017.
-
Scaling betweenness centrality using communication-efficient sparse matrix multiplication
Authors:
Edgar Solomonik,
Maciej Besta,
Flavio Vella,
Torsten Hoefler
Abstract:
Betweenness centrality (BC) is a crucial graph problem that measures the significance of a vertex by the number of shortest paths leading through it. We propose Maximal Frontier Betweenness Centrality (MFBC): a succinct BC algorithm based on novel sparse matrix multiplication routines that performs a factor of $p^{1/3}$ less communication on $p$ processors than the best known alternatives, for gra…
▽ More
Betweenness centrality (BC) is a crucial graph problem that measures the significance of a vertex by the number of shortest paths leading through it. We propose Maximal Frontier Betweenness Centrality (MFBC): a succinct BC algorithm based on novel sparse matrix multiplication routines that performs a factor of $p^{1/3}$ less communication on $p$ processors than the best known alternatives, for graphs with $n$ vertices and average degree $k=n/p^{2/3}$. We formulate, implement, and prove the correctness of MFBC for weighted graphs by leveraging monoids instead of semirings, which enables a surprisingly succinct formulation. MFBC scales well for both extremely sparse and relatively dense graphs. It automatically searches a space of distributed data decompositions and sparse matrix multiplication algorithms for the most advantageous configuration. The MFBC implementation outperforms the well-known CombBLAS library by up to 8x and shows more robust performance. Our design methodology is readily extensible to other graph problems.
△ Less
Submitted 9 August, 2017; v1 submitted 22 September, 2016;
originally announced September 2016.
-
Algorithms and Heuristics for Scalable Betweenness Centrality Computation on Multi-GPU Systems
Authors:
Flavio Vella,
Giancarlo Carbone,
Massimo Bernaschi
Abstract:
Betweenness Centrality (BC) is steadily growing in popularity as a metrics of the influence of a vertex in a graph. The BC score of a vertex is proportional to the number of all-pairs-shortest-paths passing through it. However, complete and exact BC computation for a large-scale graph is an extraordinary challenge that requires high performance computing techniques to provide results in a reasonab…
▽ More
Betweenness Centrality (BC) is steadily growing in popularity as a metrics of the influence of a vertex in a graph. The BC score of a vertex is proportional to the number of all-pairs-shortest-paths passing through it. However, complete and exact BC computation for a large-scale graph is an extraordinary challenge that requires high performance computing techniques to provide results in a reasonable amount of time. Our approach combines bi-dimensional (2-D) decomposition of the graph and multi-level parallelism together with a suitable data-thread mapping that overcomes most of the difficulties caused by the irregularity of the computation on GPUs. Furthermore, we propose novel heuristics which exploit the topology information of the graph in order to reduce time and space requirements of BC computation. Experimental results on synthetic and real-world graphs show that the proposed techniques allow the BC computation of graphs which are too large to fit in the memory of a single computational node along with a significant reduction of the computing time.
△ Less
Submitted 2 February, 2016;
originally announced February 2016.
-
Artwork creation by a cognitive architecture integrating computational creativity and dual process approaches
Authors:
Agnese Augello,
Ignazio Infantino,
Antonio Lieto,
Giovanni Pilato,
Riccardo Rizzo,
Filippo Vella
Abstract:
The paper proposes a novel cognitive architecture (CA) for computational creativity based on the Psi model and on the mechanisms inspired by dual process theories of reasoning and rationality. In recent years, many cognitive models have focused on dual process theories to better describe and implement complex cognitive skills in artificial agents, but creativity has been approached only at a descr…
▽ More
The paper proposes a novel cognitive architecture (CA) for computational creativity based on the Psi model and on the mechanisms inspired by dual process theories of reasoning and rationality. In recent years, many cognitive models have focused on dual process theories to better describe and implement complex cognitive skills in artificial agents, but creativity has been approached only at a descriptive level. In previous works we have described various modules of the cognitive architecture that allows a robot to execute creative paintings. By means of dual process theories we refine some relevant mechanisms to obtain artworks, and in particular we explain details about the resolution level of the CA dealing with different strategies of access to the Long Term Memory (LTM) and managing the interaction between S1 and S2 processes of the dual process theory. The creative process involves both divergent and convergent processes in either implicit or explicit manner. This leads to four activities (exploratory, reflective, tacit, and analytic) that, triggered by urges and motivations, generate creative acts. These creative acts exploit both the LTM and the WM in order to make novel substitutions to a perceived image by properly mixing parts of pictures coming from different domains. The paper highlights the role of the interaction between S1 and S2 processes, modulated by the resolution level, which focuses the attention of the creative agent by broadening or narrowing the exploration of novel solutions, or even drawing the solution from a set of already made associations. An example of artificial painter is described in some experimentations by using a robotic platform.
△ Less
Submitted 4 January, 2016;
originally announced January 2016.