Skip to content

Conversation

@roliveros-ramos
Copy link

I stepped up into some species with millions of records and no way to control the maximum number of records to download. For an initial exploration in particular, this can be very useful. This pull request adds a new argument to the function 'occurrence': limit, controling the maximum number of records to download. The default value (NULL) reproduce the previous behaviour of the function, downloading all records, so nothing is expected to be broken with this change.

Copy link
Collaborator

@sformel sformel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roliveros-ramos apologies it has taken us a while to respond to this PR. The idea seems sound to me, I have reviewed the code and the implementation works.

@pieterprovoost you can approve this PR, but feel free to review and offer feedback if you would like.

@pieterprovoost
Copy link
Member

@sformel There is a potential problem with this implementation: I believe the total returned by the API is currently exact (I have to double check the API implementation), but it may become an estimate as the database grows. That's why it says Retrieved x records of approximately x. So I would prefer that this is not used as a hard limit to stop fetching data. Also, I don't think there's a need for a separate count query as the first page includes the total. I agree with adding the parameter but I think the implementation can be simpler, just adjust the page size if the limit is smaller, and stop iterating when the requested number of records has been fetched.

@sformel
Copy link
Collaborator

sformel commented Jun 18, 2025

Good points. I've worked with ChatGPT to address these concerns and I think I did it correctly:

#' Find occurrences.
#'
#' @usage occurrence(scientificname = NULL, taxonid = NULL, datasetid = NULL,
#'   nodeid = NULL, instituteid = NULL, areaid = NULL, startdate = NULL, enddate = NULL,
#'   startdepth = NULL, enddepth = NULL, geometry = NULL,
#'   measurementtype = NULL, measurementtypeid = NULL, measurementvalue = NULL,
#'   measurementvalueid = NULL, measurementunit = NULL, measurementunitid = NULL,
#'   redlist = NULL, hab = NULL, wrims = NULL, extensions = NULL, hasextensions = NULL,
#'   mof = NULL, dna = NULL, absence = NULL, event = NULL, dropped = NULL,
#'   flags = NULL, exclude = NULL, fields = NULL, qcfields = NULL, verbose = FALSE)
#' @param scientificname the scientific name.
#' @param taxonid the taxon identifier (WoRMS AphiaID).
#' @param datasetid the dataset identifier.
#' @param nodeid the OBIS node identifier.
#' @param instituteid the OBIS institute identifier.
#' @param areaid the OBIS area identifier.
#' @param startdate the earliest date on which occurrence took place.
#' @param enddate the latest date on which the occurrence took place.
#' @param startdepth the minimum depth below the sea surface.
#' @param enddepth the maximum depth below the sea surface.
#' @param measurementtype the measurement type to be included in the measurements data.
#' @param measurementtypeid the measurement type ID to be included in the measurements data.
#' @param measurementvalue the measurement value to be included in the measurements data.
#' @param measurementvalueid the measurement value ID to be included in the measurements data.
#' @param measurementunit the measurement unit to be included in the measurements data.
#' @param measurementunitid the measurement unit ID to be included in the measurements data.
#' @param geometry a WKT geometry string.
#' @param redlist include only IUCN Red List species.
#' @param hab include only IOC-UNESCO HAB species.
#' @param wrims include only WRiMS species.
#' @param extensions which extensions to include (e.g. MeasurementOrFact, DNADerivedData, default = \code{NULL}).
#' @param hasextensions which extensions need to be present (e.g. MeasurementOrFact, DNADerivedData, default = \code{NULL}).
#' @param mof include measurements data (default = \code{NULL}).
#' @param dna include DNA data (default = \code{NULL}).
#' @param absence only include absence records (\code{TRUE}), exclude absence records (\code{NULL}) or include absence records (\code{include}).
#' @param event only include pure event records (\code{TRUE}), exclude pure event records (\code{NULL}) or include event records (\code{include}).
#' @param dropped only include dropped records (\code{TRUE}), exclude dropped records (\code{NULL}) or include dropped records (\code{include}).
#' @param flags quality flags which need to be set.
#' @param exclude quality flags to be excluded from the results.
#' @param fields fields to be included in the results.
#' @param qcfields include lists of missing and invalid fields (default = \code{NULL}).
#' @param verbose logical. Optional parameter to enable verbose logging (default = \code{FALSE}).
#' @param limit Integer. Maximum number of records to return. Default: all available records.
#' @return The occurrence records.
#' @examples
#' records <- occurrence(scientificname = "Abra sibogai")
#' records <- occurrence(taxonid = 141438, startdate = as.Date("2007-10-10"))
#' @export
occurrence <- function(
  scientificname = NULL,
  taxonid = NULL,
  datasetid = NULL,
  nodeid = NULL,
  instituteid = NULL,
  areaid = NULL,
  startdate = NULL,
  enddate = NULL,
  startdepth = NULL,
  enddepth = NULL,
  geometry = NULL,
  measurementtype = NULL,
  measurementtypeid = NULL,
  measurementvalue = NULL,
  measurementvalueid = NULL,
  measurementunit = NULL,
  measurementunitid = NULL,
  redlist = NULL,
  hab = NULL,
  wrims = NULL,
  extensions = NULL,
  hasextensions = NULL,
  mof = NULL,
  dna = NULL,
  absence = NULL,
  event = NULL,
  dropped = NULL,
  flags = NULL,
  exclude = NULL,
  fields = NULL,
  qcfields = NULL,
  verbose = FALSE,
  limit = NULL
) {

  after <- "-1"
  result_list <- list()
  last_page <- FALSE
  i <- 1
  fetched <- 0

  # Construct query
  query <- list(
    scientificname = handle_vector(scientificname),
    taxonid = handle_vector(taxonid),
    datasetid = handle_vector(datasetid),
    nodeid = handle_vector(nodeid),
    instituteid = handle_vector(instituteid),
    areaid = handle_vector(areaid),
    startdate = handle_date(startdate),
    enddate = handle_date(enddate),
    startdepth = startdepth,
    enddepth = enddepth,
    geometry = geometry,
    measurementtype = measurementtype,
    measurementtypeid = measurementtypeid,
    measurementvalue = measurementvalue,
    measurementvalueid = measurementvalueid,
    measurementunit = measurementunit,
    measurementunitid = measurementunitid,
    redlist = handle_logical(redlist),
    hab = handle_logical(hab),
    wrims = handle_logical(wrims),
    extensions = handle_vector(extensions),
    hasextensions = handle_vector(hasextensions),
    mof = handle_logical(mof),
    dna = handle_logical(dna),
    absence = absence,
    event = event,
    dropped = dropped,
    flags = handle_vector(flags),
    exclude = handle_vector(exclude),
    fields = handle_fields(fields),
    qcfields = handle_logical(qcfields)
  )

  # Optional usage tracking
  if (getOption("robis_log_usage", TRUE)) {
    http_request("GET", "metrics/logusage", c(query, list(agent = "robis")), verbose)
  }

  total <- NULL  # Use NULL to indicate "not yet assigned", safer than NA

  # Set initial page size, capped by limit if provided
  psize <- if (is.null(limit)) page_size() else min(limit, page_size())

  while (!last_page) {
    # Request a page of occurrence data
    result <- http_request("GET", "occurrence", c(query, list(
      after = after,
      size = psize,
      total = FALSE  # avoid triggering extra count logic on backend
    )), verbose)

    if (is.null(result)) return(invisible(NULL))

    res <- fromJSON(content(result, "text", encoding = "UTF-8"), simplifyVector = TRUE)

    # If the results are empty or not a data frame, we're done
    if (is.null(res$results) || !is.data.frame(res$results) || nrow(res$results) == 0) {
      break
    }

    # Extract page results once for clarity
    results <- res$results

    # Assign estimated total for progress logging (only on first page)
    if (is.null(total) && !is.null(res$total)) {
      total <- res$total
    }

    # Flatten any array-type fields
    if ("node_id" %in% names(results)) {
      results$node_id <- sapply(results$node_id, paste0, collapse = ",")
    }
    if ("flags" %in% names(results)) {
      results$flags <- sapply(results$flags, paste0, collapse = ",")
      results$flags[results$flags == ""] <- NA
    }
    if ("invalid" %in% names(results)) {
      results$invalid <- sapply(results$invalid, paste0, collapse = ",")
      results$invalid[results$invalid == ""] <- NA
    }
    if ("missing" %in% names(results)) {
      results$missing <- sapply(results$missing, paste0, collapse = ",")
      results$missing[results$missing == ""] <- NA
    }

    # Force certain columns to character to avoid type conflicts
    character_cols <- c("sex", "testing")
    results <- results %>%
      mutate_at(intersect(names(results), character_cols), as.character)

    # Store this page
    result_list[[i]] <- results
    fetched <- fetched + nrow(results)
    i <- i + 1

    # Update paging cursor
    after <- tail(results$id, 1)

    # Log progress using the estimated total (if known)
    log_progress(fetched, total)

    # Stop if we've reached or exceeded the requested limit
    if (!is.null(limit) && fetched >= limit) {
      break
    }

    # Update next page size to avoid overfetching
    psize <- min(limit - fetched, page_size())
  }

  # Combine all pages into one data frame
  data <- bind_rows(result_list)

  # If more records than needed were fetched (very rare), truncate
  if (!is.null(limit) && nrow(data) > limit) {
    data <- data[seq_len(limit), ]
  }

  # Optionally compute average depth if relevant columns are available
  depthFields <- intersect(c("minimumDepthInMeters", "maximumDepthInMeters"), names(data))
  if (length(depthFields) > 0) {
    data$depth <- rowMeans(data[depthFields], na.rm = TRUE)
    data$depth[is.nan(data$depth)] <- NA
  }

  return(as_tibble(data))
}

@roliveros-ramos I can submit this as a new PR, or you could modify yours; whichever you prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants