-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Background
Over at futureverse/future#760, @fproske reports that serializedSize() consumes a lot of memory. They detected this because their containers/VMs are getting killed by OOM, after upgrading to a future version that rely on serializedSize().
I think they used the profvis package to show that they see about ~100 MB of memory allocated by serializedSize(). Indeed, if I run something like:
prof <- profvis::profvis({
for (kk in 1:1e6) parallelly::serializedSize(NULL)
})I see lots of memory being reported, e.g.
| Code | File | Memory (MB) [deallocated/allocated] |
|---|---|---|
parallelly::serializedSize |
<expr> |
-4246.6 / 4492.0 |
for (kk in 1:1e6) parallelly::serializedSize(NULL) |
<expr> |
-445.3 / 221.1 |
but that looks odd to me.
Troubleshooting
I'm not sure how this happens, but it could be that the internal serialization code of R that we rely on materializes each intermitten object, which we never make use of - we are only interested in the byte counts. Our code is in https://github.com/futureverse/parallelly/blob/develop/src/calc-serialized-size.c.
It could be that something else is going on here. To better inspect them memory allocations, I going low-level base::Rprof(), which profvis uses internally. With this, I get:
library(parallelly)
R <- 1e7
ns <- c(0, 1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7)
data <- data.frame(n = ns, size = double(length(sizes)), bytes_per_call = double(length(sizes)))
for (kk in seq_len(nrow(data))) {
n <- data$n[kk]
x <- rnorm(n)
size <- object.size(x)
message(sprintf("Object size: %d bytes", size))
data[kk, "size"] <- size
Rprof(memory.profiling = TRUE)
for (rr in 1:R) { serializedSize(x) }
Rprof(NULL)
prof <- summaryRprof(memory = "both")
mem_avg <- prof$by.total[['"serializedSize"', "mem.total"]] * 1024^2 / R
data[kk, "bytes_per_call"] <- mem_avg
}
print(data)With R = 1e5, I get:
n size bytes_per_call
1 0e+00 48 907.0182
2 1e+00 56 717.2260
3 1e+02 848 959.4470
4 1e+03 8048 761.2662
5 1e+04 80048 2690.6460
6 1e+05 800048 2552.2340
7 1e+06 8000048 2403.3362
8 1e+07 80000048 2819.6209
With R = 1e6, I get:
> data
n size bytes_per_call
1 0e+00 48 3794.3771
2 1e+00 56 4143.5529
3 1e+02 848 2741.6068
4 1e+03 8048 2570.6889
5 1e+04 80048 1286.8125
6 1e+05 800048 298.8442
7 1e+06 8000048 294.0207
8 1e+07 80000048 2794.4550
With R = 1e7, I get:
n size bytes_per_call
1 0e+00 48 1587.072
2 1e+00 56 1433.246
3 1e+02 848 1556.359
4 1e+03 8048 1615.164
5 1e+04 80048 2918.145
6 1e+05 800048 1345.313
7 1e+06 8000048 1154.765
8 1e+07 80000048 5260.957
I'm not sure what to make of this, because this says that only 2-5 kB is allocated per serializedSize() call regardless of size of object being sized.
@coolbutuseless, as a expert on serialization and the one who came up with serializedSize(), do you know if the internals materialize the different objects as they are being serialized? If so, do you if the R API allows us to avoid that? For instance, if I use:
con <- file(nullfile(), open = "wb")
void <- serialize(x, connection = con)
close(con)I think the objects being serialized are immediately streamed to the null file, avoiding any materializing in memory. I wonder if that strategy could be used in serializedSize().