Part 2: GBIF Retrieval, Taxonomy, and Filtering

Scope

This vignette covers the package components that sit upstream of range inference:

get_status() to inspect the GBIF backbone taxon concept and synonym set,
get_gbif_count() to gauge likely download size,
get_gbif() to retrieve occurrences without GBIF credentials,
obs_filter() to thin or aggregate occurrences for downstream work,
make_tiles() to build explicit GBIF geometry tiles,
get_doi() to turn one or more downloads into a citable GBIF-derived DOI.

The examples below follow a practical sequence that works well in real analyses: resolve taxonomy first, estimate likely download size second, and only then choose the most appropriate retrieval strategy for the spatial scale and record volume of the analysis.

Why taxonomy comes first

In gbif.range, taxonomy is not an afterthought. The package uses the GBIF backbone to decide which accepted taxon concept a query refers to and which synonyms belong to that concept.

This is why get_status() is often the best first step:

tax <- get_status("Cypripedium calceolus", level = "children")
tax

When level = "children", the output is deliberately close to the logic used by get_gbif(): it returns the accepted name and the synonyms that are actually used for occurrence retrieval, i.e., including subspecies and variety. While level = "accepted" is only exploratory and strictly keeps species level names. When level = "all", additional related names are also included for taxonomic inspection, but those extra names are not used to download occurrences.

The practical consequence is important: get_gbif() harmonizes the query to the accepted GBIF taxon key, but the returned occurrence table still preserves both record-level and accepted-name fields.

Strict versus permissive name matching

An important distinction is controlled by search. With search = TRUE, get_gbif() expects a GBIF backbone match for the focal taxon concept. With search = FALSE, the function becomes more permissive and can return records for fuzzy matches or higher-rank names.

get_gbif("Panthera tigris", search = TRUE)      # records for the tiger concept
get_gbif("Panthera tigriiis", search = TRUE)    # typically no records
get_gbif("Panthera tigriiis", search = FALSE)   # permissive fuzzy retrieval
get_gbif("Acer", search = FALSE)                # higher-rank retrieval

For most single-species analyses the strict setting is preferable, because it keeps the biological interpretation clear. The permissive mode is more appropriate when the aim is exploratory retrieval, manual inspection, or broader higher-rank data collection.

Count before download

For broad extents or common taxa, a quick count helps you decide whether a direct retrieval is appropriate:

get_gbif_count("Panthera tigris", search = TRUE)

This is particularly useful if you need to decide between:

a direct get_gbif() call,
a more strongly filtered call with a stricter grain or occ_samp,
or a downloaded GBIF export that will later be processed with the disk-based batch workflow described in Part 3.

In practice, this is the quickest way to decide whether a direct get_gbif() call remains convenient or whether the project has crossed into the “download first, process on disk later” regime.

Credential-free GBIF retrieval

get_gbif() is a credential-free wrapper around rgbif::occ_search(). The main package contribution is that it combines taxonomic harmonization, geographic tiling of large extents, and a practical sequence of post-download filters in one function.

# Download global tiger occurrences with the default filters.
obs_tiger <- get_gbif("Panthera tigris", grain = 100)

# Inspect the accepted name and synonym mapping used internally.
get_status("Panthera tigris", level = "children")

# Retrieve a terrestrial ecoregion layer and build the range.
eco_terra <- read_ecoreg("eco_terra")
range_tiger <- get_range(
  occ_coord = obs_tiger,
  ecoreg = eco_terra,
  ecoreg_name = "ECO_NAME"
)

The arguments most often worth adjusting are:

grain, which filters by coordinate uncertainty and decimal precision,
basis and establishment, which control the allowed record types,
time_period, if the analysis targets a defined time window,
occ_samp, if the geographic extent is so large that a full retrieval is unnecessary or impractical.

These arguments have different roles. grain is the main spatial-quality filter. basis, establishment, and related arguments define which kinds of records are biologically acceptable. occ_samp is the pragmatic scaling argument when a full credential-free retrieval would be too slow or too large for the immediate task.

A marine large-extent example

Delphinus delphis is a good example of a case where record volume becomes a practical constraint. The point is not that marine workflows are special inside the package, but that large extents can require more deliberate choices about sampling and ecoregions.

# Retrieve a manageable subsample per internally generated tile.
obs_delphis <- get_gbif(
  sp_name = "Delphinus delphis",
  occ_samp = 1000
)

# Inspect the internal get_gbif() logic and all related names.
get_status("Delphinus delphis", level = "all")

# Build a marine range from the packaged ecoregions.
eco_marine <- read_ecoreg("eco_marine")
range_delphis <- get_range(
  occ_coord = obs_delphis,
  ecoreg = eco_marine,
  ecoreg_name = "ECOREGION"
)

This is exactly the kind of analysis where get_gbif_count() is useful upstream. If the expected volume is extremely large and the project targets many taxa, the disk-based workflow in Part 3 is usually the more scalable choice.

Post-download thinning with `obs_filter()`

The package also includes a lightweight grid-based thinning helper. This is useful when many records fall into the same grid cell and you want one retained observation per species per cell before a downstream analysis.

The example below uses the bundled offline GBIF-style example table rather than a live web query.

occ_raw <- utils::read.delim(ext_file("occ_example_4sps.csv"), sep = "\t", stringsAsFactors = FALSE)
occ_raw$input_search <- occ_raw$species
occ_gbif <- gbif.range:::getGBIF(occ_raw)

# Build a coarse grid that spans the example records.
grid <- terra::rast(
  xmin = min(occ_gbif$decimalLongitude) - 1,
  xmax = max(occ_gbif$decimalLongitude) + 1,
  ymin = min(occ_gbif$decimalLatitude) - 1,
  ymax = max(occ_gbif$decimalLatitude) + 1,
  resolution = 10,
  crs = "EPSG:4326"
)

# Keep at most one record per species and grid cell.
obs_thin <- obs_filter(occ_gbif, grid)
head(obs_thin)
#>           Species        x         y
#> 1 Crocuta crocuta 34.23704  -0.99994
#> 2 Crocuta crocuta 24.23704 -20.99994
#> 3 Crocuta crocuta 44.23704   9.00006
#> 4 Crocuta crocuta 24.23704 -10.99994
#> 5 Crocuta crocuta 34.23704 -20.99994
#> 6 Crocuta crocuta 14.23704 -20.99994

This kind of thinning does not replace ecological cleaning or taxonomic checks, but it can be a useful way to reduce local clustering before plotting or model calibration.

Dense clusters of records can easily dominate visual inspection or downstream calibration even when they add little new geographic information, so this kind of lightweight aggregation is often worth doing early.

Explicit tiling workflows

get_gbif() handles tiling internally, but make_tiles() is available when you want to create GBIF-ready geometry tiles yourself:

tiles <- make_tiles(terra::ext(-20, 40, 0, 60), nsize = 5)
tiles[[1]]

This is mostly useful for custom rgbif workflows or explicit diagnostics of how an extent is being subdivided.

Reproducible GBIF citation

Once one or more get_gbif() calls have been made, get_doi() can be used to generate a GBIF-derived DOI reference for those records:

doi_url <- get_doi(obs_tiger)
doi_url

This is a small function, but it is scientifically important because it improves traceability and citation of the exact GBIF-derived datasets used in an analysis.

Take-home message

The GBIF-facing side of gbif.range is built around a clear sequence:

inspect the GBIF backbone taxon concept with get_status(),
count likely record volume with get_gbif_count(),
download filtered occurrences with get_gbif(),
optionally thin or aggregate them with obs_filter(),
cite the resulting GBIF-derived datasets with get_doi().

That sequence keeps taxonomic interpretation, record selection, and reproducibility visible before range inference begins.