
Part 2: GBIF Retrieval, Taxonomy, and Filtering
Source:vignettes/gbif-retrieval-and-taxonomy.Rmd
gbif-retrieval-and-taxonomy.RmdScope
This vignette covers the package components that sit upstream of range inference:
-
get_status()to inspect the GBIF backbone taxon concept and synonym set, -
get_gbif_count()to gauge likely download size, -
get_gbif()to retrieve occurrences without GBIF credentials, -
obs_filter()to thin or aggregate occurrences for downstream work, -
make_tiles()to build explicit GBIF geometry tiles, -
get_doi()to turn one or more downloads into a citable GBIF-derived DOI.
The examples below follow a practical sequence that works well in real analyses: resolve taxonomy first, estimate likely download size second, and only then choose the most appropriate retrieval strategy for the spatial scale and record volume of the analysis.
Why taxonomy comes first
In gbif.range, taxonomy is not an afterthought. The
package uses the GBIF backbone to decide which accepted taxon concept a
query refers to and which synonyms belong to that concept.
This is why get_status() is often the best first
step:
tax <- get_status("Cypripedium calceolus", level = "children")
taxWhen level = "children", the output is deliberately
close to the logic used by get_gbif(): it returns the
accepted name and the synonyms that are actually used for occurrence
retrieval, i.e., including subspecies and variety. While
level = "accepted" is only exploratory and strictly keeps
species level names. When level = "all", additional related
names are also included for taxonomic inspection, but those extra names
are not used to download occurrences.
The practical consequence is important: get_gbif()
harmonizes the query to the accepted GBIF taxon key, but the returned
occurrence table still preserves both record-level and accepted-name
fields.
Strict versus permissive name matching
An important distinction is controlled by search. With
search = TRUE, get_gbif() expects a GBIF
backbone match for the focal taxon concept. With
search = FALSE, the function becomes more permissive and
can return records for fuzzy matches or higher-rank names.
get_gbif("Panthera tigris", search = TRUE) # records for the tiger concept
get_gbif("Panthera tigriiis", search = TRUE) # typically no records
get_gbif("Panthera tigriiis", search = FALSE) # permissive fuzzy retrieval
get_gbif("Acer", search = FALSE) # higher-rank retrievalFor most single-species analyses the strict setting is preferable, because it keeps the biological interpretation clear. The permissive mode is more appropriate when the aim is exploratory retrieval, manual inspection, or broader higher-rank data collection.
Count before download
For broad extents or common taxa, a quick count helps you decide whether a direct retrieval is appropriate:
get_gbif_count("Panthera tigris", search = TRUE)This is particularly useful if you need to decide between:
- a direct
get_gbif()call, - a more strongly filtered call with a stricter
grainorocc_samp, - or a downloaded GBIF export that will later be processed with the disk-based batch workflow described in Part 3.
In practice, this is the quickest way to decide whether a direct
get_gbif() call remains convenient or whether the project
has crossed into the “download first, process on disk later” regime.
Credential-free GBIF retrieval
get_gbif() is a credential-free wrapper around
rgbif::occ_search(). The main package contribution is that
it combines taxonomic harmonization, geographic tiling of large extents,
and a practical sequence of post-download filters in one function.
# Download global tiger occurrences with the default filters.
obs_tiger <- get_gbif("Panthera tigris", grain = 100)
# Inspect the accepted name and synonym mapping used internally.
get_status("Panthera tigris", level = "children")
# Retrieve a terrestrial ecoregion layer and build the range.
eco_terra <- read_ecoreg("eco_terra")
range_tiger <- get_range(
occ_coord = obs_tiger,
ecoreg = eco_terra,
ecoreg_name = "ECO_NAME"
)The arguments most often worth adjusting are:
-
grain, which filters by coordinate uncertainty and decimal precision, -
basisandestablishment, which control the allowed record types, -
time_period, if the analysis targets a defined time window, -
occ_samp, if the geographic extent is so large that a full retrieval is unnecessary or impractical.
These arguments have different roles. grain is the main
spatial-quality filter. basis, establishment,
and related arguments define which kinds of records are biologically
acceptable. occ_samp is the pragmatic scaling argument when
a full credential-free retrieval would be too slow or too large for the
immediate task.
A marine large-extent example
Delphinus delphis is a good example of a case where record volume becomes a practical constraint. The point is not that marine workflows are special inside the package, but that large extents can require more deliberate choices about sampling and ecoregions.
# Retrieve a manageable subsample per internally generated tile.
obs_delphis <- get_gbif(
sp_name = "Delphinus delphis",
occ_samp = 1000
)
# Inspect the internal get_gbif() logic and all related names.
get_status("Delphinus delphis", level = "all")
# Build a marine range from the packaged ecoregions.
eco_marine <- read_ecoreg("eco_marine")
range_delphis <- get_range(
occ_coord = obs_delphis,
ecoreg = eco_marine,
ecoreg_name = "ECOREGION"
)This is exactly the kind of analysis where
get_gbif_count() is useful upstream. If the expected volume
is extremely large and the project targets many taxa, the disk-based
workflow in Part 3 is usually the more scalable choice.
Post-download thinning with obs_filter()
The package also includes a lightweight grid-based thinning helper. This is useful when many records fall into the same grid cell and you want one retained observation per species per cell before a downstream analysis.
The example below uses the bundled offline GBIF-style example table rather than a live web query.
occ_raw <- utils::read.delim(ext_file("occ_example_4sps.csv"), sep = "\t", stringsAsFactors = FALSE)
occ_raw$input_search <- occ_raw$species
occ_gbif <- gbif.range:::getGBIF(occ_raw)
# Build a coarse grid that spans the example records.
grid <- terra::rast(
xmin = min(occ_gbif$decimalLongitude) - 1,
xmax = max(occ_gbif$decimalLongitude) + 1,
ymin = min(occ_gbif$decimalLatitude) - 1,
ymax = max(occ_gbif$decimalLatitude) + 1,
resolution = 10,
crs = "EPSG:4326"
)
# Keep at most one record per species and grid cell.
obs_thin <- obs_filter(occ_gbif, grid)
head(obs_thin)
#> Species x y
#> 1 Crocuta crocuta 34.23704 -0.99994
#> 2 Crocuta crocuta 24.23704 -20.99994
#> 3 Crocuta crocuta 44.23704 9.00006
#> 4 Crocuta crocuta 24.23704 -10.99994
#> 5 Crocuta crocuta 34.23704 -20.99994
#> 6 Crocuta crocuta 14.23704 -20.99994This kind of thinning does not replace ecological cleaning or taxonomic checks, but it can be a useful way to reduce local clustering before plotting or model calibration.
Dense clusters of records can easily dominate visual inspection or downstream calibration even when they add little new geographic information, so this kind of lightweight aggregation is often worth doing early.
Explicit tiling workflows
get_gbif() handles tiling internally, but
make_tiles() is available when you want to create
GBIF-ready geometry tiles yourself:
tiles <- make_tiles(terra::ext(-20, 40, 0, 60), nsize = 5)
tiles[[1]]This is mostly useful for custom rgbif workflows or
explicit diagnostics of how an extent is being subdivided.
Reproducible GBIF citation
Once one or more get_gbif() calls have been made,
get_doi() can be used to generate a GBIF-derived DOI
reference for those records:
doi_url <- get_doi(obs_tiger)
doi_urlThis is a small function, but it is scientifically important because it improves traceability and citation of the exact GBIF-derived datasets used in an analysis.
Take-home message
The GBIF-facing side of gbif.range is built around a
clear sequence:
- inspect the GBIF backbone taxon concept with
get_status(), - count likely record volume with
get_gbif_count(), - download filtered occurrences with
get_gbif(), - optionally thin or aggregate them with
obs_filter(), - cite the resulting GBIF-derived datasets with
get_doi().
That sequence keeps taxonomic interpretation, record selection, and reproducibility visible before range inference begins.