Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ Collate:
'dataset-places365.R'
'dataset-plankton.R'
'dataset-rf100-peixos.R'
'dataset-vggface2.R'
'extension.R'
'globals.R'
'imagenet.R'
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,7 @@ export(transform_rotate)
export(transform_ten_crop)
export(transform_to_tensor)
export(transform_vflip)
export(vggface2_dataset)
export(vision_make_grid)
export(whoi_plankton_dataset)
export(whoi_small_coralnet_dataset)
Expand Down
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,13 @@
* Added `lfw_people_dataset()` and `lfw_pairs_dataset()` for loading Labelled Faces in the Wild (LFW) datasets (@DerrickUnleashed, #203).
* Added `places365_dataset()`for loading the Places365 dataset (@koshtiakanksha, #196).
* Added `pascal_segmentation_dataset()`, and `pascal_detection_dataset()` for loading the Pascal Visual Object Classes datasets (@DerrickUnleashed, #209).
* Added `whoi_plankton_dataset()`, and `whoi_small_plankton_dataset()` (@cregouby, #236).
* Added `whoi_plankton_dataset()`, `whoi_small_plankton_dataset()`, and `whoi_small_coral_dataset()` (@cregouby, #236).
* Added `rf100_document_collection()`, `rf100_medical_collection()`, `rf100_biology_collection()`, `rf100_damage_collection()`, `rf100_infrared_collection()`,
and `rf100_underwater_collection()` . Those are collection of datasets from RoboFlow 100 under the same
thematic, for a total of 35 datasets (@koshtiakanksha, @cregouby, #239).
* Added `rf100_peixos_segmentation_dataset()`. (@koshtiakanksha, @cregouby, #250).
* Added `vggface2_dataset()` for loading the VGGFace2 dataset (@DerrickUnleashed, #238).

## New models

Expand Down
185 changes: 185 additions & 0 deletions R/dataset-vggface2.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
#' VGGFace2 Dataset
#'
#' The VGGFace2 dataset is a large-scale face recognition dataset containing images
#' of celebrities from a wide range of ethnicities, professions, and ages.
#' Each identity has multiple images with large variations in pose, age, illumination,
#' ethnicity, and profession.
#'
#' @inheritParams oxfordiiitpet_dataset
#' @param root Character. Root directory where the dataset will be stored under `root/vggface2`.
#'
#' @return A torch dataset object `vggface2_dataset`:
#' - `x`: RGB image array.
#' - `y`: Integer label (1…N) for the identity.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo blocking y integer label is the output of a classification dataset, not an instance segmentation dataset. Either change one or the other

#'
#' `ds$classes` is a named list mapping integer labels to a list with:
#' - `name`: Character name of the person.
#' - `gender`: "Male" or "Female".
#'
#' @examples
#' \dontrun{
#' #Load the training set
#' ds <- vggface2_dataset(download = TRUE)
#' item <- ds[1]
#' item$x # image array RGB
#' item$y # integer label
#' ds$classes[item$y] # list(name=..., gender=...)
#'
#' #Load the test set
#' ds <- vggface2_dataset(download = TRUE, train = FALSE)
#' item <- ds[1]
#' item$x # image array RGB
#' item$y # integer label
#' ds$classes[item$y] # list(name=..., gender=...)
#' }
#'
#' @family segmentation_dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question this is (the first) instance segmentation dataset in the repo. Shall we mix it with object segmentation datasets @family segmentation_dataset or shall we define a new family ? Defining a new family would require an update of the website (see dataset categories in the _pkgdown.yml file) and a test/update for proper management by downstream functions (draw_segmentation_mask, ...)
suggestion let it like this but create an issue "vggface2 is an instance segmentation dataset" with a todo list on that.

#' @export
vggface2_dataset <- torch::dataset(
name = "vggface2",
resources = data.frame(
split = c("train_images", "test_images", "train_list", "test_list", "identity"),
url = c(
"https://huggingface.co/datasets/ProgramComputer/VGGFace2/resolve/main/data/vggface2_train.tar.gz",
"https://huggingface.co/datasets/ProgramComputer/VGGFace2/resolve/main/data/vggface2_test.tar.gz",
"https://huggingface.co/datasets/ProgramComputer/VGGFace2/resolve/main/meta/train_list.txt",
"https://huggingface.co/datasets/ProgramComputer/VGGFace2/resolve/main/meta/test_list.txt",
"https://huggingface.co/datasets/ProgramComputer/VGGFace2/raw/main/meta/identity_meta.csv"
),
md5 = c(
"88813c6b15de58afc8fa75ea83361d7f",
"bb7a323824d1004e14e00c23974facd3",
"4cfbab4a839163f454d7ecef28b68669",
"d08b10f12bc9889509364ef56d73c621",
"d315386c7e8e166c4f60e27d9cc61acc"
)
),
training_file = "train.rds",
test_file = "test.rds",

initialize = function(
root = tempdir(),
train = TRUE,
transform = NULL,
target_transform = NULL,
download = FALSE
) {
self$root_path <- root
self$transform <- transform
self$target_transform <- target_transform
if (train) {
self$split <- "train"
self$archive_size <- "36 GB"
} else {
self$split <- "test"
self$archive_size <- "2 GB"
}

if (download) {
cli_inform("Dataset {.cls {class(self)[[1]]}} (~{.emph {self$archive_size}}) will be downloaded and processed if not already available.")
self$download()
}

if (!self$check_exists()) {
cli_abort("Dataset not found. You can use `download = TRUE` to download it.")
}

if (train) {
data_file <- self$training_file
} else {
data_file <- self$test_file
}
data <- readRDS(file.path(self$processed_folder, data_file))

self$img_path <- data$img_path
self$labels <- data$labels
self$classes <- data$classes

cli_inform("{.cls {class(self)[[1]]}} dataset loaded with {self$.length()} images across {length(self$classes)} classes.")
},

download = function() {
if (self$check_exists()) {
return()
}

fs::dir_create(self$raw_folder)
fs::dir_create(self$processed_folder)

cli_inform("Downloading {.cls {class(self)[[1]]}}...")

for (i in seq_len(nrow(self$resources))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question Do we need to download 36GB of data (train-set) when the end-user request train=FALSE ?
suggestion I would save user time and disk space by limiting the download to the requested split

row <- self$resources[i, ]
archive <- download_and_cache(row$url, prefix = row$split)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo could we prepend class(self)[[1]] to row$split for prefix= ? (this is to avoid 5 different files, hard to identify as being part of vggface2, spread in the root cache folder)

if (tools::md5sum(archive) != row$md5) {
runtime_error("Corrupt file! Delete the file in {archive} and try again.")
}
if (tools::file_ext(row$url) == "gz") {
utils::untar(archive, exdir = self$raw_folder)
} else {
fs::file_move(archive, self$raw_folder)
}
}

identity_file <- file.path(self$raw_folder, "identity_meta.csv")
identity_df <- read.csv(identity_file, sep = ",", stringsAsFactors = FALSE, strip.white = TRUE)
identity_df$Class_ID <- trimws(identity_df$Class_ID)
identity_df$Gender <- factor(identity_df$Gender, levels = c("f", "m"), labels = c("Female", "Male"))

for (split in c("train", "test")) {
if (split == "train") {
list_file <- file.path(self$raw_folder, "train_list.txt")
} else {
list_file <- file.path(self$raw_folder, "test_list.txt")
}

split_df <- read.delim(
list_file,
sep = "/",
col.names = c("Class_ID", "img_path"),
header = FALSE,
stringsAsFactors = FALSE
)

merged_df <- merge(split_df, identity_df, by = "Class_ID", all.x = TRUE)
merged_df$Label <- as.integer(factor(merged_df$Class_ID, levels = unique(merged_df$Class_ID)))

saveRDS(
merged_df,
file.path(self$processed_folder, paste0(split, ".rds"))
)
}
cli_inform("Dataset {.cls {class(self)[[1]]}} downloaded and extracted successfully.")
},

check_exists = function() {
fs::file_exists(file.path(self$processed_folder, self$training_file)) &&
fs::file_exists(file.path(self$processed_folder, self$test_file))
},

.getitem = function(index) {
x <- jpeg::readJPEG(self$img_path[index])
y <- self$labels[index]

if (!is.null(self$transform)) {
x <- self$transform(x)
}
if (!is.null(self$target_transform)) {
y <- self$target_transform(y)
}
list(x = x, y = y)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo please add the segmentation class to the output

},

.length = function() {
length(self$img_path)
},

active = list(
raw_folder = function() {
file.path(self$root_path, "vggface2", "raw")
},
processed_folder = function() {
file.path(self$root_path, "vggface2", "processed")
}
)
)
1 change: 1 addition & 0 deletions man/oxfordiiitpet_segmentation_dataset.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions man/pascal_voc_datasets.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

68 changes: 68 additions & 0 deletions man/vggface2_dataset.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

32 changes: 32 additions & 0 deletions tests/testthat/test-dataset-vggface2.R
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo missing as there may be impact on draw_segmentation_mask could we add a check in one of the tests for drawing segmentation mask of the item ?

Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
context('dataset-vggface2')

t <- withr::local_tempdir()
options(timeout = 60000)

test_that("VGGFace2 dataset works correctly for train split", {

skip_if(Sys.getenv("TEST_LARGE_DATASETS", unset = 0) != 1,
"Skipping test: set TEST_LARGE_DATASETS=1 to enable tests requiring large downloads.")

vgg <- vggface2_dataset(root = t, download = TRUE)
expect_length(vgg, 3141890)
first_item <- vgg[1]
expect_named(first_item, c("x", "y"))
expect_type(first_item$x, "double")
expect_type(first_item$y, "integer")
expect_equal(first_item$y, 1)
})

test_that("VGGFace2 dataset works correctly for test split", {

skip_if(Sys.getenv("TEST_LARGE_DATASETS", unset = 0) != 1,
"Skipping test: set TEST_LARGE_DATASETS=1 to enable tests requiring large downloads.")

vgg <- vggface2_dataset(root = t, train = FALSE)
expect_length(vgg, 169396)
first_item <- vgg[1]
expect_named(first_item, c("x", "y"))
expect_type(first_item$x, "double")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo missing as you know the first item, it would be nice to add a check on dim() of the first_item$x object. This gives a hit of image size to folks.

expect_type(first_item$y, "integer")
expect_equal(first_item$y, 1)
})
Loading