8000
Skip to content

wyx619/UltraGBIF

Repository files navigation

An ultra-fast and user-friendly R package for compiling plant occurrence records from GBIF

Introduction

Mapping plant distributions is fundamental to understanding biodiversity patterns, accurate distribution data and such information is necessary for researching plant diversity. Global Biodiversity Information Facility, known as GBIF, is a large repository for plant occurrence records worldwide. It has fueled over 18,200 peer-reviewed journal articles, with ecology (3,769 researches), climate change (2,953), conservation (1,915), and invasive species management (1,840) as of August 2025, supporting global policy frameworks like the Kunming-Montreal Global Biodiversity Framework.

Researchers using GBIF occurrence records usually rely on a suite of packages and scrips, consuming lots of runtime, Such as rgbif, TNRS, CoordinateCleaner, bdc, plantR, NSR and GVS help to deal with GBIF occurrence records. Moreover, for million records datasets, current methods incur substantial computational overhead through manual chaining of disparate packages, necessitating high-performance infrastructure despite advancing computational capabilities.

To rectify this situation, we introduce UltraGBIF, an efficient R package that unifies taxonomic resolution, spatial validation, duplicate consolidation, and botanical region annotation within a high-performance framework. Its optimized C/C++ backend and intelligent parallelization enable compiling one million GBIF occurrence records on a laptop within 15 minutes. In a word, UltraGBIF resolves challenges in reproducibility, scalability, and spatial-taxonomic integrity without increasing adoption barriers for biodiversity researchers.

Workflow

Three main stages and seven modules of UltraGBIF. After all stages, generally 35% of the initial occurrence records are retained. Workflow

UltraGBIF provides a reproducible, plant-optimized, and computationally efficient framework for transforming raw GBIF occurrence records into analysis-ready datasets. The package functions are categorized into three main stages and seven distinct modules.

Stage 1: Data Ingestion

This stage ensures data accuracy and consistency through three modules:

  1. Import Data: This module receives a user-provided Darwin Core Archive that adheres to GBIF data conventions. The DwC-A is loaded locally (e.g., occurrences.csv/zip) and any extensions described by meta.xml. GBIF-reported issue flags are automatically extracted for downstream quality assessment.

  2. Check Taxon Name: This module implements taxonomic name standardization to resolve and validate plant names. User may select between the World Checklist of Vascular Plants (WCVP; Govaerts et al., 2021) and the Taxonomic Name Resolution Service (TNRS, Boyle et al. 2013). This step unifies synonyms and corrects misspellings.

  3. Check Collector Name: This module standardizes collector names to reduce inconsistencies (e.g., "Smith, J." versus "J. Smith") that can fragment single collection events. By preparing a standardized dictionary of primary collector surnames, this step reduces identification errors by over 80% and improves the accuracy of subsequent duplication checks.

Stage 2: Deduplication and Reliability Filtering

This stage improves data reliability by identifying high-quality, non-redundant occurrence records.

  1. Generate Unique Collection Mark: This module identifies and consolidates duplicates into unique collection events. A collection event represents a distinct sampling instance (a specific collector at a specific time and place).

  2. Set Digital Voucher: For duplicate entries sharing a collection mark, the record with the highest metadata quality is retained as the "digital voucher." This approach preserves the most geographically informative data while minimizing redundancy, thereby improving the spatially reliability.

Stage 3: Refine Records

The final stage restores key information, enhances geospatial accuracy, and enables visualization.

  1. Refine records: This module validates spatial information and restores detailed metadata for usable vouchers. It performs automated coordinate validation using CoordinateCleaner (Zizka et al., 2019) to flag spatial errors (e.g., centroids, capitals, institutions). It also extracts information from WCVP to annotate records as 'native', 'introduced', or 'doubtful'. The optional occTest package (Serra-Diaz et al., 2024) is also compatible here for advanced quality control checks.

  2. Map records: An optional visualization module that renders verified records onto customizable, dynamic maps, providing an intuitive interface for viewing spatial distributions and data density.

Focused exclusively on GBIF plant occurrence records, UltraGBIF is able to clean one million records within 15 minutes on a laptop, representing 60% memory reduction. In a word, UltraGBIF integrates these components into a unified, automated workflow that enhances data standardization, accuracy, and usability, which enables robust, reproducible, and scalable compiling of GBIF occurrence records for advanced biodiversity research.

UltraGBIF is under development. If you encounter any bugs, please feel free to submit an issue. Your feedback is greatly appreciated!

Installation

You can install UltraGBIF from GitHub. UltraGBIF runs with rWCVPdata, so install it firstly (We recommend rWCVPdata version 0.6.0 with WCVP version 14), and the initial installation takes some time.

if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes", dependencies = TRUE)}
remotes::install_github("matildabrown/rWCVPdata", upgrade=F) ## install rWCVPdata
remotes::install_github("wyx619/UltraGBIF", upgrade=F) ## install UltraGBIF

If you meet any internet error, download rWCVPdata and install manually. The initial installation also takes some time.

if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes", dependencies = TRUE)}
remotes::install_local("path/to/your/rWCVPdata_0.6.0.tar.gz", upgrade=F) ## install rWCVPdata manually
remotes::install_github("wyx619/UltraGBIF", upgrade=F) ## install UltraGBIF

Tutorial of UltraGBIF

Tutorial of UltraGBIF: wiki page

Tutorial of UltraGBIF: pkgdown page

Reference

Boyle, Brad, Nicole Hopkins, Zhenyuan Lu, Juan Antonio Raygoza Garay, Dmitry Mozzherin, Tony Rees, Naim Matasci, et al. 2013. “The Taxonomic Name Resolution Service: An Online Tool for Automated Standardization of Plant Names.” BMC Bioinformatics 14 (1): 16. https://doi.org/10.1186/1471-2105-14-16.

De Melo, Pablo Hendrigo Alves, Nadia Bystriakova, Eve Lucas, and Alexandre K. Monro. 2024. “A New R Package to Parse Plant Species Occurrence Records into Unique Collection Events Efficiently Reduces Data Redundancy.” Scientific Reports 14 (1): 5450. https://doi.org/10.1038/s41598-024-56158-3.

Govaerts, Rafaël, Eimear Nic Lughadha, Nicholas Black, Robert Turner, and Alan Paton. 2021. “The World Checklist of Vascular Plants, a Continuously Updated Resource for Exploring Global Plant Diversity.” Scientific Data 8 (1): 215. https://doi.org/10.1038/s41597-021-00997-6.

Maitner, Brian, and Brad Boyle. 2024. “TNRS: Taxonomic Name Resolution Service.” https://CRAN.R-project.org/package=TNRS.

Zizka, Alexander, Daniele Silvestro, Tobias Andermann, Josué Azevedo, Camila Duarte Ritter, Daniel Edler, Harith Farooq, et al. 2019. “CoordinateCleaner : Standardized Cleaning of Occurrence Records from Biological Collection Databases.” Edited by Tiago Quental. Methods in Ecology and Evolution 10 (5): 744–51. https://doi.org/10.1111/2041-210X.13152.

About

An Ultrafast and User-friendly R package to Parse and Merge Enormous GBIF Occurrence Records

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published
0