Tutorial of UltraGBIF
Yuxuan Wang
Sichuan Universitywyx1743@gmail.com
2025-11-19
Source:vignettes/Tutorial_of_UltraGBIF.Rmd
Tutorial_of_UltraGBIF.RmdDescription
This Tutorial parses and merges Saxifraga occurrence records from GBIF.
Installation
You can install UltraGBIF from GitHub. UltraGBIF runs with rWCVPdata, so install it firstly (We recommend rWCVPdata version 0.6.0 with WCVP version 14), and the initial installation takes some time.
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes", dependencies = TRUE)}
remotes::install_github("matildabrown/rWCVPdata", upgrade=F) ## install rWCVPdata
remotes::install_github("wyx619/UltraGBIF", upgrade=F) ## install UltraGBIFIf you meet any internet error, download rWCVPdata and install manually. The initial installation also takes some time.
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes", dependencies = TRUE)}
remotes::install_local("path/to/your/rWCVPdata_0.6.0.tar.gz", upgrade=F) ## install rWCVPdata manually
remotes::install_github("wyx619/UltraGBIF", upgrade=F) ## install UltraGBIFImport records
Initially, download occurrence records from GBIF. Filter records online such as Basis of record = Preserved specimen, Occurrence status = present, and Scientific name = Saxifraga. Get the DwC-A file and remember its absolute path.
This module also utilizes EnumOccurrence Issue values to
transform these issues into a structured format, enabling systematic
quality assessment through individualized issue tracking and
standardized classification. See more details
here.
library(UltraGBIF)
gbif_occurrence_file <- "path/to/the/DwC-A file"
occ_import <- import_records(GBIF_file = gbif_occurrence_file,
only_PRESERVED_SPECIMEN = T)occ_import includes occ for initial
occurrence records, occ_gbif_issue for GBIF issues and its
summary, and time cost.
Check taxon name
UltraGBIF uses WCVP(Govaerts et al.
2021) as a high-authority taxonomic backbone, systematically
resolving taxons in GBIF occurrence records. Alternatively, TNRS(Boyle et al. 2013) is supported by setting the
parameter local_taxon_resolution = FALSE.
To optimize computational efficiency, this module supports
parallelization through multi-threading by specifying the desired
concurrency level via the threads parameter.
names.checked <- check_occ_name(occ_import = occ_import,
threads = 4,
local_taxon_resolution = TRUE,
if_author_fails_try_without_combinations = TRUE)names.checked includes occ_wcvp_check_name
for records with resolved taxons and its summary, and time
cost.
Check collectors dictionary
This module extracts the primary collector’s surname from the
recordedBy field in occurrence records and generates an
association list linking standardized surnames to corresponding records.
It assists to resolve collector identity ambiguities and enforce
nomenclatural consistency. Referred CollectorDictionary in
this module comes from ParseGBIF(De Melo et al.
2024).
dictionary <- prepare_collectors_dictionary(occ_import = occ_import,
surname_selection_type = "largest_string",
max_words_name = 6,
min_characters_in_name = 2)dictionary includes my_dictionary for
checked collectors name and ref_dictionary for referred
dictionary, and time cost.
Generate unique collection mark
To resolve duplicate records in biodiversity datasets, UltraGBIF generates a standardized composite key for each unique collection event like ParseGBIF(De Melo et al. 2024) does. This key is constructed by concatenating three core metadata fields:
Plant family (
family)Standardized collector surname (
recordedByStandardized)Numeric-only collection number (
recordNumber_Standard)
The resulting identifier,
key_family_recordedBy_recordNumber, groups specimens
sharing identical collection events. An optional temporal variant,
key_year_recordedBy_recordNumber, incorporates collection
year where available to enhance resolution.
Standardization ensures robust matching: non-numeric characters are stripped from collection numbers, and collector names are normalized using predefined dictionaries to reconcile spelling variants.
This module addresses two key challenges:
Disambiguating common collector surnames through taxonomic context.
Enabling automated duplicate detection without reliance on often-missing temporal metadata.
collection_key <- generate_collection_mark(occ_import = occ_import,
dictionary = dictionary)collection_key includes
occ_collectorsDictionary for input records updated with
standardized fields and generated keys and
CollectorsDictionary_add for newly encountered collectors
for dictionary expansion and reuseand time costs.
Set digital voucher
Following generation of a unique collection event identifier (collection_key), this module resolves duplicate records and outlines the selection of the master digital voucher and associated integration to represent each event.
Duplicate Grouping and Voucher Selection
For records possessing complete collection event keys, duplicates are identified and grouped based on identical
collection_keyvalues. Within each group, a master digital voucher is selected by evaluating a composite score that combines two dimensions: metadata completeness and geospatial reliability. Metadata completeness is quantified as the sum of binary presence indicators across 10 fields, includingrecordedBy,locality, andcatalogNumber, where presence is scored as 1 and absence as 0. Geospatial reliability is assigned according to GBIF’sEnumOccurrenceIssueclassification, with values of −1 indicating unaffected coordinates, −3 indicating potential positional error, and −9 indicating coordinates unsuitable for spatial analysis. The record achieving the highest combined score is designated the master voucher. Any missing metadata in the master record are supplemented using non-redundant values drawn from other records within the same group.For records with incomplete keys (e.g., missing
recordedByorrecordNumber), no grouping is performed, and each is treated as a singleton event. Nevertheless, metadata completeness and geospatial scores are computed identically to ensure consistent, system-wide data quality assessment.
Taxonomic Assignment
For groupable events, the accepted taxon name (
UltraGBIF_sample_taxon_name) is assigned as the most frequently occurring species- or subspecies-level identification among duplicates; ties are resolved alphabetically. Unidentified events are flagged (UltraGBIF_unidentified_sample = TRUE).Non-groupable records are processed individually with identical taxonomic logic.
Geospatial Integration
If the master voucher has unusable coordinates (score = -9) or lacks coordinates entirely, usable coordinates from duplicates (score ≥ -3) are preferentially retained.
Standardized Output Fields
Key output variables include:
UltraGBIF_digital_voucher: identifies master record (logical);UltraGBIF_duplicates: presence of duplicates (logical);UltraGBIF_num_duplicates: count of grouped records;UltraGBIF_non_groupable_duplicates: records with incomplete keys (logical);UltraGBIF_duplicates_grouping_status: grouping feasibility (e.e., “groupable”, “not groupable: no recordedBy”);UltraGBIF_sample_taxon_name_status: taxonomic resolution (“identified”, “divergent”, “not identified”);UltraGBIF_useful_for_spatial_analysis: coordinate usability (logical).
This module ensures robust duplicate resolution, metadata enrichment, and interoperability with global biodiversity standards.
voucher <- set_digital_voucher(occ_import = occ_import,
names.checked = names.checked,
collection_key = collection_key)voucher includes the full flagged digital voucher of
occurrence records and its summary, and time cost.
Refine records
When a voucher is flagged as usable, its missing
information in fields is restored by consolidating data from duplicate
records belonging to identical collection events, creating one clean
record per unique event, whether the event key is complete or not. It
picks the most common species-level name (using alphabetical order to
break ties) or labels the event “unidentified” if no such name exists;
incomplete keys are treated as separate events. Missing coordinates in
the main record are filled in from duplicates.
For finally complete usable vouchers, their associated occurrence records are automatically subjected to geospatial validation and precise WGSRPD attribution. Leveraging the R packages CoordinateCleaner(Zizka et al. 2019) and terra, this module efficiently performs coordinate validation and cleaning, targeting common geospatial anomalies such as misassigned capitals, geographic centroids, duplicate coordinates, GBIF headquarters artifacts, institution-proximate records, distributional outliers, marine points in terrestrial datasets, and zero/placeholder coordinates.
Then each validated record is matched to its corresponding WGSRPD level-3 Botanical Country classifications with different status (native/introduced/extinct/location_doubtful), which minimizes mis-classification risks in cross-regional analyses.
To optimize computational efficiency, this module supports
parallelization through multi-threading by specifying the desired
concurrency level via the threads parameter.
refine <- refine_records(voucher = voucher,
threads = 4,
save_path = dirname (gbif_occurrence_file),
tests = c("capitals", "centroids", "equal", "gbif", "institutions", "outliers", "seas","zeros"))refine includes all_records for all cleaned
records and native_records for native of them, and time
costs. If you set save_path, then all_records
and native_records are exported, named as
usable_refined_records.csv.gz and
native_refined_records.csv.gz, both are potentially
meritorious for your future research.
Map records
This module is implemented optionally, rendering the output from
coordinate_refine, employing geohashTools(Chirico 2023) to perform spatial clustering of
geo-coordinates. The precision parameter determines grid
resolution (e.g. 4 for 20 km×20 km grids, 3 for 156 km×156 km grids). It
groups records by UltraGBIF_wcvp_taxon_name, WGSRPD
information and geohash grid and follows deduplication.
After simplification, dynamic maps with different layers are rendered
by mapview(Appelhans et al. 2023) with
WGSRPD information legend. Symbol size is controlled by the
cex parameter. Moreover, they supports zoom functionality
for spatial resolution adjustment and enables interactive selection of
individual coordinates to display their full attributes.
map_records(records_refined = refine,
precision = 4,
cex = 4)