-
Notifications
You must be signed in to change notification settings - Fork 28
block/file: do pread/pwrite from Propolis heap instead of VM memory #985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| // worker threads for all file-backed disks, times this threshold. It works out | ||
| // to up to 128 MiB (8 worker threads) of buffers per disk by default as of | ||
| // writing. | ||
| const MAPPING_IO_LIMIT_BYTES: usize = 16 * crate::common::MB; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with a Linux guest, dtrace says that this is a wildly large limit: when I have fio doing 16M I/Os, that turns into block_begin_write firing with 256kb of data to move (even though the controller tells the guest mdts=0 -> "no limit" on data transfer size). presumably Linux is breaking up the I/Os before making NVMe commands, but I don't plan on looking too closely.
As described in the comment on `MAPPING_IO_LIMIT_BYTES`, this is a gross
hack: if VM memory is used as the iovec for `p{read,write}v` to a block
device or zvol, we'll end up contending on `svmd->svmd_lock` pretty
heavily in higher IOPS situations.
This might arise when doing fewer larger I/Os to zvols or block devices
as well, when we'll be forwarding each of the 4k entries in guest PRP
lists as a distinct iovec, but we haven't tested that.
Either way, operating from Propolis heap avoids the issue for now. One
could imagine this "should" be a configurable behavior on file block
backends, since many files (like disk images in propolis-standalone)
would not need this. The hope is that we can fix this in the kernel so
the hack is no longer needed for zvols or block devices, rather than
leaning further into properly supporting this copy hack.
c1bfd48 to
62fba95
Compare
|
Interesting, this sure seems like it should be worse than actually writing straight into the guest address space, but the contention making that worse makes sense. I agree that fixing the kernel seems like a better long term solution. |
| // Beyond this, either fall back to using iovecs directly (and trust the kernel | ||
| // to handle bizarre vecs) or error. This is an especially gross hack because we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"bizarre" here just meaning "in excess of 16 megabytes"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worse: imagine a list of iovec where the ptr/len pair is just repeated 10000 times. if we did range coalescing in Propolis (i should do an issue about that actually), an evil driver could submit a write using 1MB (256 pages) but repeat that several hundred times to produce GiB+ writes kinda easily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually since IOV_MAX is just 1024 this is less of a problem than i'd thought (... so we really ought to set MDTS to 10 lol)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worse: imagine a list of iovec where the ptr/len pair is just repeated 10000 times. if we did range coalescing in Propolis (i should do an issue about that actually), an evil driver could submit a write using 1MB (256 pages) but repeat that several hundred times to produce GiB+ writes kinda easily
oh, that's ghastly. ew. okay.
| // If we're motivated to avoid the zero-fill via | ||
| // `Layout::with_size_align` + `GlobalAlloc::alloc`, we should | ||
| // probably avoid this gross hack entirely (see comment on | ||
| // MAPPING_IO_LIMIT_BYTES). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair!
| let read = unsafe { | ||
| libc::preadv(fd, iov.as_ptr(), iov.len() as libc::c_int, offset) | ||
| }; | ||
| if read == -1 { | ||
| return Err(Error::last_os_error()); | ||
| } | ||
| let read: usize = read.try_into().expect("read is positive"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought we had someplace a thingy for doing a libc call and checking errno if the return value is -1, but after looking closer, it looks like this is just for ioctl; i wonder if it's worth having something like that for other libc functions like preadv (since we've got a bunch of similar things).
i'm imagining a thingy like
fn syscall_result(x: i32) -> Result<usize, io::Error> {
usize::try_from(x).map_err(|_| io::Error::last_os_error())
}not a blocker, naturally!
lib/propolis/src/vmm/mem.rs
Outdated
| // Sketchy: nothing guarantees the guest is not concurrently | ||
| // modifying this memory. Also, nothing prevents the guest from | ||
| // submitting the same ranges as part of another I/O where we | ||
| // might be doing this same operation on another thread on the | ||
| // same ptr/len pair. Because we are just moving u8's, it should | ||
| // be OK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
um.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, so. one, regardless of the fact that we are "just moving u8s", i think that technically rustc is permitted to emit nasal demons in this case unless we use atomic memcpy, which doesn't exist yet. but, since the other mutable accesses to this region would all be coming from the guest and thus operating on accesses rustc doesn't know are aliased, i can't imagine it being able to figure out that it's allowed to nasal demons us. so. sure, whatever, i guess it's fine.
with that said, of course, this would arbitrarily scramble the guest's I/O, but if they submit the same region as part of multiple iovecs, they deserve it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, so. one, regardless of the fact that we are "just moving u8s", i think that technically rustc is permitted to emit nasal demons in this case unless we use rust-lang/rust#58599
What if you unsafe { libc::memcpy(...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, this should be std::ptr::copy_nonoverlapping actually. i forgot that exists at first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, reworking this in terms of copy_nonoverlapping nudged me towards adjusting the copy both in read and write a little more substantially, but i like where it ended up a lot better.
| // Gross hack alert: since the mappings below are memory regions backed by | ||
| // segvmm_ops, `zvol_{read,write}` and similar will end up contending on | ||
| // `svmd->svmd_lock`. Instead, as long as the I/O is small enough we'll tolerate | ||
| // it, copy from guest memory to Propolis heap. The segment backing Propolis' | ||
| // heap has an `as_page{,un}lock` impl that avoids the more | ||
| // expensive/contentious `as_fault()` fallback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, i hate this a lot, but your rationale makes sense. i'll let you have it. :)
hawkw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, yeah, i think the use of copy_nonoverlapping is legit here. couple little nits but no blocking worries on my end.
| total_capacity.checked_add(mapping.len).ok_or_else(|| { | ||
| Error::new( | ||
| ErrorKind::InvalidInput, | ||
| "Total mapping too large", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, take it or leave it: perhaps
| "Total mapping too large", | |
| "Total length of all mappings would overflow a `usize`", |
or something --- seeing this error would make me go "too large for what?"
| if read == -1 { | ||
| return Err(Error::last_os_error()); | ||
| } | ||
| let read: usize = read.try_into().expect("read is positive"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, take it or leave it: we could, perhaps, avoid the panic here if we instead wrote:
| if read == -1 { | |
| return Err(Error::last_os_error()); | |
| } | |
| let read: usize = read.try_into().expect("read is positive"); | |
| let read = match usize::try_from(read) { | |
| Ok(read) => read, | |
| Err(_) => return Err(Error::last_os_error()), | |
| }; |
this way, we don't branch on the value of read twice, which...is maybe nice. it does mean we don't panic if preadv decides to return -420 or some shit, but...if we care about that, we could assert_eq!(read, -1) in the error path, i guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UPDATE: oh i also said a similarish thing last week: (https://github.com/oxidecomputer/propolis/pull/985/files#r2612098032, time is a flat circle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that suggestion, with a caveat: this is basically let read = usize::try_from(read).map_err(|_| Error::last_os_error())?;, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually...thinking about this a bit more.... Perhaps we avoid direct use of libc to whatever extent that we can?
nix has a wrapper for preadv that does more sensible error handling that returns a Result over usize, which is exactly what we want. Simiarly, I saw @iliana recommend rustix the other day, which does similarly.
On a meta note, I'm not sure which crate is best here, but perhaps company-wide perhaps we should at least present a soft preference for something that's slightly higher-level than libc?
| // example), but `copy_nonoverlapping` has no compunctions about | ||
| // concurrent mutation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps worth noting explicitly that if a guest does decide it wants to screw up its own iovec for some reason, it would be pretty damn hard for LLVM to figure out a way to miscompile us in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also drop down to something like the volatile copy intrinsic, which is sadly not exposed in the standard library (presumably via core) as intrinsics are unstable, but that would require either nightly or shenanigans. I proposed something in rust-lang/rfcs#2728 a long while back, but encountered a lot of resistance and ran out of steam to push that change forward.
| total_capacity.checked_add(mapping.len).ok_or_else(|| { | ||
| Error::new( | ||
| ErrorKind::InvalidInput, | ||
| "Total mapping too large", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly, let's maybe say something like
| "Total mapping too large", | |
| "Total length of all mappings would overflow a `usize`", |
| if read == -1 { | ||
| return Err(Error::last_os_error()); | ||
| let mut total_capacity: usize = 0; | ||
| for mapping in self.as_ref().iter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It strikes me that this is basically a hand-written version of Iterator::try_reduce, which is sadly nightly-only still. However, it's close relative, try_fold is stable. Perhaps this might be phrased as:
let total_capacity = self.as_ref()
.iter()
.try_fold(0usize, |total, mapping| total.checked_add(mapping.len))
.ok_or_else(|| {
Error::new(
ErrorKind::InvalidInput,
"sum of mapping lengths overflows usize",
)
})?;
As described in the comment on
MAPPING_IO_LIMIT_BYTES, this is a gross hack: if VM memory is used as the iovec forp{read,write}vto a block device or zvol, we'll end up contending onsvmd->svmd_lockpretty heavily in higher IOPS situations.This might arise when doing fewer larger I/Os to zvols or block devices as well, when we'll be forwarding each of the 4k entries in guest PRP lists as a distinct iovec, but we haven't tested that.
Either way, operating from Propolis heap avoids the issue for now. One could imagine this "should" be a configurable behavior on file block backends, since many files (like disk images in propolis-standalone) would not need this. The hope is that we can fix this in the kernel so the hack is no longer needed for zvols or block devices, rather than leaning further into properly supporting this copy hack.
For good measure, I did some testing on a
propolis-standalonewith this locally. Of the 1% time in user code when doing large I/Os aprofile-1003only turned up three stacks here, all in the malloc/free for the buffers.