
Part 3: Large Downloaded GBIF Tables
Source:vignettes/large-downloaded-gbif-tables.Rmd
large-downloaded-gbif-tables.RmdScope
This vignette documents the disk-based workflow for large downloaded GBIF tables. It is intended for situations where occurrences have already been exported from GBIF and the full table is too large or too inconvenient to keep in memory.
The workflow is built around three functions:
-
split_gbif_by_species()to split a large GBIF table into one file per species or GBIF key, -
species_csvs_to_ranges()to read those files sequentially and build one range per species, -
read_range_rds()to read the saved range outputs back from disk.
This workflow is meant for the point where direct credential-free
GBIF retrieval is no longer the most practical option. For many
single-species or moderate-volume tasks, get_gbif() is
enough. For very large downloaded exports, working from files on disk is
usually the better choice.
Why the workflow is split into two steps
get_range() is intentionally a single-species function.
That is scientifically sensible, because range inference depends on one
focal set of occurrences and one associated taxon concept at a time.
For large multi-species downloads, the package therefore separates file handling from range inference:
- first create one species file per GBIF key,
- then process those files one by one.
This design keeps peak memory use low and makes intermediate files easy to inspect, clean, archive, or parallelize outside R if needed.
It also keeps the biological interpretation clean. Each file still
corresponds to one focal GBIF taxon key, and each range is then inferred
with the same get_range() machinery described in Part
1.
Example data
The package includes a small GBIF-style offline example under
inst/extdata. The file is much smaller than a real GBIF
export, but it exercises the same workflow.
gbif_file <- ext_file("occ_example_4sps.csv")
utils::head(utils::read.delim(gbif_file, sep = "\t", stringsAsFactors = FALSE))
#> speciesKey species scientificName decimalLongitude
#> 1 5218781 Crocuta crocuta Crocuta crocuta (Erxleben, 1777) 34.91667
#> 2 5218781 Crocuta crocuta Crocuta crocuta (Erxleben, 1777) 34.91667
#> 3 5218781 Crocuta crocuta Crocuta crocuta (Erxleben, 1777) 34.91667
#> 4 5218781 Crocuta crocuta Crocuta crocuta (Erxleben, 1777) 34.91667
#> 5 5218781 Crocuta crocuta Crocuta crocuta (Erxleben, 1777) 34.91667
#> 6 5218777 Hyaena hyaena Hyaena hyaena (Linnaeus, 1758) 36.35000
#> decimalLatitude basisOfRecord
#> 1 -1.41667 PRESERVED_SPECIMEN
#> 2 -1.41667 PRESERVED_SPECIMEN
#> 3 -1.41667 PRESERVED_SPECIMEN
#> 4 -1.41667 PRESERVED_SPECIMEN
#> 5 -1.41667 PRESERVED_SPECIMEN
#> 6 -1.18333 PRESERVED_SPECIMENStep 1: split the downloaded table by species
The first step is to stream the input file in chunks and write one species file per GBIF key.
batch_root <- file.path(tempdir(), "gbif-range-batch-vignette")
split_dir <- file.path(batch_root, "split")
occ_dir <- file.path(batch_root, "occ_min")
range_dir <- file.path(batch_root, "ranges")
# Start from a clean temporary workspace so the vignette is reproducible.
unlink(batch_root, recursive = TRUE)
split_summary <- split_gbif_by_species(
input_file = gbif_file,
outdir = split_dir,
chunk_size = 100,
sep_in = "\t",
sep_out = "\t",
overwrite = TRUE,
verbose = FALSE
)
knitr::kable(
transform(split_summary,
species_file = basename(species_file)
)[, c("species_name", "n_records", "species_file")],
align = "l"
)| species_name | n_records | species_file | |
|---|---|---|---|
| 4 | Crocuta crocuta | 7217 | occurrences_speciesKey_5218781_Crocuta_crocuta.csv |
| 3 | Hyaena brunnea | 495 | occurrences_speciesKey_5218780_Hyaena_brunnea.csv |
| 2 | Hyaena hyaena | 106 | occurrences_speciesKey_5218777_Hyaena_hyaena.csv |
| 1 | Proteles cristata | 281 | occurrences_speciesKey_2433502_Proteles_cristata.csv |
A few implementation details are worth highlighting.
The splitter reads only the requested columns, not the full source
table. It keeps the file extension as .csv for convenience,
but the files remain tab-delimited by default. This preserves GBIF-style
exports while still producing species-specific files that are easy to
browse or reuse.
This is also the right place to simplify the table before any range-building starts. In a production workflow, this step can be used to retain only the columns needed for the later analysis and to create a species-level file archive that is easier to inspect than one monolithic GBIF export.
Step 2: build ranges sequentially from disk
For this vignette, the ecoregion layer is a single enclosing polygon built from the extent of the example records. This keeps the example fully offline and lightweight while still exercising the file-based range workflow.
In ordinary use, you would typically replace this with
ecoreg = "eco_terra" or with a spatial object returned by
read_ecoreg("eco_terra").
occ_example <- utils::read.delim(gbif_file, sep = "\t", stringsAsFactors = FALSE)
eco_batch <- sf::st_sf(
data.frame(ECO_NAME = "batch_demo"),
geometry = sf::st_sfc(
sf::st_polygon(list(matrix(
c(
min(occ_example$decimalLongitude) - 5, min(occ_example$decimalLatitude) - 5,
max(occ_example$decimalLongitude) + 5, min(occ_example$decimalLatitude) - 5,
max(occ_example$decimalLongitude) + 5, max(occ_example$decimalLatitude) + 5,
min(occ_example$decimalLongitude) - 5, max(occ_example$decimalLatitude) + 5,
min(occ_example$decimalLongitude) - 5, min(occ_example$decimalLatitude) - 5
),
ncol = 2,
byrow = TRUE
))),
crs = 4326
)
)
range_summary <- species_csvs_to_ranges(
species_dir = split_dir,
ecoreg = eco_batch,
ecoreg_name = "ECO_NAME",
outdir = range_dir,
occ_outdir = occ_dir,
occ_save_as = "tsv",
range_save_as = "rds",
sep_in = "\t",
overwrite = TRUE,
degrees_outlier = 30,
clust_pts_outlier = 2,
buff_width_point = 1,
buff_incrmt_pts_line = 0.1,
buff_width_polygon = 1,
format = "SpatVector",
verbose = FALSE
)
knitr::kable(
transform(range_summary,
occ_file = sub(".*speciesKey_", "", basename(occ_file)),
range_file = sub(".*speciesKey_", "", basename(range_file))
)[, c("species_name", "n_points", "occ_file", "range_file")],
align = "l"
)| species_name | n_points | occ_file | range_file |
|---|---|---|---|
| Proteles cristatus septentrionalis Rothschild, 1902 | 232 | 2433502_Proteles_cristatus_septentrionalis_Rothschild_1902.tsv | 2433502_Proteles_cristatus_septentrionalis_Rothschild_1902.rds |
| Hyaena hyaena (Linnaeus, 1758) | 98 | 5218777_Hyaena_hyaena_Linnaeus_1758.tsv | 5218777_Hyaena_hyaena_Linnaeus_1758.rds |
| Parahyaena brunnea (Thunberg, 1820) | 413 | 5218780_Parahyaena_brunnea_Thunberg_1820.tsv | 5218780_Parahyaena_brunnea_Thunberg_1820.rds |
| Crocuta crocuta fisi Heller, 1914 | 5828 | 5218781_Crocuta_crocuta_fisi_Heller_1914.tsv | 5218781_Crocuta_crocuta_fisi_Heller_1914.rds |
Internally, species_csvs_to_ranges() reduces each
species file to the minimal structure expected by
get_range(): one focal species plus decimal longitude and
latitude. It can also save those minimal occurrence tables to disk,
which is useful if you want to inspect exactly what was passed to
get_range() after deduplication.
Once the occurrences are split to species level, it becomes straightforward to rerun the exact same batch with different ecoregions, different range arguments, or stricter deduplication while keeping the raw downloaded export unchanged.
The built-in ecoregion shortcut
The same workflow can resolve packaged ecoregions internally:
range_summary <- species_csvs_to_ranges(
species_dir = split_dir,
ecoreg = "eco_terra",
ecoreg_name = "ECO_NAME",
outdir = range_dir,
overwrite = TRUE
)This is the most convenient option when your downloaded table and your intended range product both target a broad terrestrial workflow.
It is also the most natural way to move from a large terrestrial GBIF export to a batch of ecoregion-constrained range maps without having to pre-load the ecoregion object yourself.
A production-style terrestrial batch
The offline vignette uses a simple polygon because it must build quickly and without network access. A real downloaded terrestrial workflow would normally look more like this:
#split_summary <- split_gbif_by_species(
# input_file = "gbif_download.tsv",
# outdir = "species_occurrences",
# chunk_size = 1e5,
# sep_in = "\t",
# sep_out = "\t",
# overwrite = TRUE
#)
#range_summary <- species_csvs_to_ranges(
# species_dir = "species_occurrences",
# ecoreg = "eco_terra",
# ecoreg_name = "ECO_NAME",
# outdir = "species_ranges",
# occ_outdir = "species_occurrences_min",
# occ_save_as = "tsv",
# range_save_as = "rds",
# overwrite = TRUE
#)This is the clearest batch version of the main terrestrial workflow: one GBIF export in, one species file per taxon, one saved range per taxon.
Inspect one species file before building ranges
One advantage of the split-first design is that you can inspect or clean individual species files before generating ranges. This is often useful when testing a new workflow on a few taxa before launching a full batch.
first_species_file <- split_summary$species_file[1]
utils::head(utils::read.delim(first_species_file, sep = "\t", stringsAsFactors = FALSE))
#> speciesKey species scientificName decimalLongitude
#> 1 5218781 Crocuta crocuta Crocuta crocuta (Erxleben, 1777) 34.91667
#> 2 5218781 Crocuta crocuta Crocuta crocuta (Erxleben, 1777) 34.91667
#> 3 5218781 Crocuta crocuta Crocuta crocuta (Erxleben, 1777) 34.91667
#> 4 5218781 Crocuta crocuta Crocuta crocuta (Erxleben, 1777) 34.91667
#> 5 5218781 Crocuta crocuta Crocuta crocuta (Erxleben, 1777) 34.91667
#> 6 5218781 Crocuta crocuta Crocuta crocuta (Erxleben, 1777) 34.91667
#> decimalLatitude
#> 1 -1.41667
#> 2 -1.41667
#> 3 -1.41667
#> 4 -1.41667
#> 5 -1.41667
#> 6 -1.41667That inspection step is simple, but it is often the moment where issues such as missing coordinates, duplicated points, or unexpected names become obvious.
Step 3: read and inspect the saved ranges
If the range outputs are saved as .rds,
read_range_rds() restores them to a simple list with
init.args and rangeOutput.
# Read the first saved range to inspect its structure.
first_range <- read_range_rds(range_summary$range_file[1])
names(first_range)
#> [1] "init.args" "rangeOutput"
class(first_range$rangeOutput)
#> [1] "SpatVector"
#> attr(,"package")
#> [1] "terra"The example below overlays every saved range in the batch summary. This is a useful diagnostic pattern when checking whether a batch run produced geometries with the expected extent and shape.
range_colors <- c(
rgb(0.10, 0.40, 0.75, 0.30),
rgb(0.85, 0.35, 0.10, 0.30),
rgb(0.20, 0.65, 0.30, 0.30),
rgb(0.70, 0.65, 0.10, 0.30)
)
combined_ext <- terra::ext(first_range$rangeOutput)
if (nrow(range_summary) > 1) {
for (i in 2:nrow(range_summary)) {
range_i <- read_range_rds(range_summary$range_file[i])
combined_ext <- terra::ext(
min(terra::xmin(combined_ext), terra::xmin(range_i$rangeOutput)),
max(terra::xmax(combined_ext), terra::xmax(range_i$rangeOutput)),
min(terra::ymin(combined_ext), terra::ymin(range_i$rangeOutput)),
max(terra::ymax(combined_ext), terra::ymax(range_i$rangeOutput))
)
}
}
terra::plot(combined_ext, col = NA, legend = FALSE, main = "Batch-generated ranges")
terra::plot(first_range$rangeOutput, add = TRUE, col = range_colors[1])
if (nrow(range_summary) > 1) {
for (i in 2:nrow(range_summary)) {
range_i <- read_range_rds(range_summary$range_file[i])
terra::plot(range_i$rangeOutput, add = TRUE, col = range_colors[i])
}
}
Practical advice
The disk-based workflow is most useful in three situations.
First, when a direct in-memory call to get_gbif() is not
appropriate because the project already relies on downloaded GBIF
exports.
Second, when a large multi-species table needs to be checked, cleaned, or archived at the species level before range inference.
Third, when the same species-specific occurrence files need to be
processed repeatedly with different get_range() settings,
ecoregions, or evaluation schemes.
The third point matters scientifically, not just computationally. It makes parameter sensitivity analyses much cleaner, because you can rerun the same archived species files under alternative ecoregion layers or buffering choices without changing the underlying occurrence input.
Take-home message
The disk-based workflow turns gbif.range into a
practical bridge between very large GBIF exports and species-level range
inference. Instead of treating file splitting and range building as an
external preprocessing step, Gbif.range provides a coherent, testable,
and documented path from a downloaded GBIF table to saved species ranges
on disk.