Skip to content

serializedSize(): Does it allocate memory? #126

@HenrikBengtsson

Description

@HenrikBengtsson

Background

Over at futureverse/future#760, @fproske reports that serializedSize() consumes a lot of memory. They detected this because their containers/VMs are getting killed by OOM, after upgrading to a future version that rely on serializedSize().

I think they used the profvis package to show that they see about ~100 MB of memory allocated by serializedSize(). Indeed, if I run something like:

prof <- profvis::profvis({
  for (kk in 1:1e6) parallelly::serializedSize(NULL)
})

I see lots of memory being reported, e.g.

Code File Memory (MB) [deallocated/allocated]
parallelly::serializedSize <expr> -4246.6 / 4492.0
for (kk in 1:1e6) parallelly::serializedSize(NULL) <expr> -445.3 / 221.1

but that looks odd to me.

Troubleshooting

I'm not sure how this happens, but it could be that the internal serialization code of R that we rely on materializes each intermitten object, which we never make use of - we are only interested in the byte counts. Our code is in https://github.com/futureverse/parallelly/blob/develop/src/calc-serialized-size.c.

It could be that something else is going on here. To better inspect them memory allocations, I going low-level base::Rprof(), which profvis uses internally. With this, I get:

library(parallelly)
R <- 1e7

ns <- c(0, 1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7)
data <- data.frame(n = ns, size = double(length(sizes)), bytes_per_call = double(length(sizes)))

for (kk in seq_len(nrow(data))) {
  n <- data$n[kk]
  x <- rnorm(n)
  size <- object.size(x)
  message(sprintf("Object size: %d bytes", size))
  data[kk, "size"] <- size

  Rprof(memory.profiling = TRUE)
  for (rr in 1:R) { serializedSize(x) }
  Rprof(NULL)
  prof <- summaryRprof(memory = "both")
  mem_avg <- prof$by.total[['"serializedSize"', "mem.total"]] * 1024^2 / R
  data[kk, "bytes_per_call"] <- mem_avg
}

print(data)

With R = 1e5, I get:

      n     size bytes_per_call
1 0e+00       48       907.0182
2 1e+00       56       717.2260
3 1e+02      848       959.4470
4 1e+03     8048       761.2662
5 1e+04    80048      2690.6460
6 1e+05   800048      2552.2340
7 1e+06  8000048      2403.3362
8 1e+07 80000048      2819.6209

With R = 1e6, I get:

> data
      n     size bytes_per_call
1 0e+00       48      3794.3771
2 1e+00       56      4143.5529
3 1e+02      848      2741.6068
4 1e+03     8048      2570.6889
5 1e+04    80048      1286.8125
6 1e+05   800048       298.8442
7 1e+06  8000048       294.0207
8 1e+07 80000048      2794.4550

With R = 1e7, I get:

      n     size bytes_per_call
1 0e+00       48       1587.072
2 1e+00       56       1433.246
3 1e+02      848       1556.359
4 1e+03     8048       1615.164
5 1e+04    80048       2918.145
6 1e+05   800048       1345.313
7 1e+06  8000048       1154.765
8 1e+07 80000048       5260.957

I'm not sure what to make of this, because this says that only 2-5 kB is allocated per serializedSize() call regardless of size of object being sized.

@coolbutuseless, as a expert on serialization and the one who came up with serializedSize(), do you know if the internals materialize the different objects as they are being serialized? If so, do you if the R API allows us to avoid that? For instance, if I use:

con <- file(nullfile(), open = "wb")
void <- serialize(x, connection = con)
close(con)

I think the objects being serialized are immediately streamed to the null file, avoiding any materializing in memory. I wonder if that strategy could be used in serializedSize().

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions