Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2506.15114 (cs)

[Submitted on 18 Jun 2025]

Title:Parallel Data Object Creation: Towards Scalable Metadata Management in High-Performance I/O Library

Authors:Youjia Li, Robert Latham, Robert Ross, Ankit Agrawal, Alok Choudhary, Wei-Keng Liao

Abstract:High-level I/O libraries, such as HDF5 and PnetCDF, are commonly used by large-scale scientific applications to perform I/O tasks in parallel. These I/O libraries store the metadata such as data types and dimensionality along with the raw data in the same files. While these libraries are well-optimized for concurrent access to the raw data, they are designed neither to handle a large number of data objects efficiently nor to create different data objects independently by multiple processes, as they require applications to call data object creation APIs collectively with consistent metadata among all processes. Applications that process data gathered from remote sensors, such as particle collision experiments in high-energy physics, may generate data of different sizes from different sensors and desire to store them as separate data objects. For such applications, the I/O library's requirement on collective data object creation can become very expensive, as the cost of metadata consistency check increases with the metadata volume as well as the number of processes. To address this limitation, using PnetCDF as an experimental platform, we investigate solutions in this paper that abide the netCDF file format, as well as propose a new file header format that enables independent data object creation. The proposed file header consists of two sections, an index table and a list of metadata blocks. The index table contains the reference to the metadata blocks and each block stores metadata of objects that can be created collectively or independently. The new design achieves a scalable performance, cutting data object creation times by up to 582x when running on 4096 MPI processes to create 5,684,800 data objects in parallel. Additionally, the new method reduces the memory footprints, with each process requiring an amount of memory space inversely proportional to the number of processes.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2506.15114 [cs.DC]
	(or arXiv:2506.15114v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2506.15114

Submission history

From: Youjia Li [view email]
[v1] Wed, 18 Jun 2025 03:33:47 UTC (3,012 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Parallel Data Object Creation: Towards Scalable Metadata Management in High-Performance I/O Library

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Parallel Data Object Creation: Towards Scalable Metadata Management in High-Performance I/O Library

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators