Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 176 additions & 25 deletions lib/propolis/src/vmm/mem.rs
Original file line number Diff line number Diff line change
Expand Up @@ -754,6 +754,28 @@ pub trait MappingExt {
fn pwritev(&self, fd: RawFd, offset: i64) -> Result<usize>;
}

// Gross hack alert: since the mappings below are memory regions backed by
// segvmm_ops, `zvol_{read,write}` and similar will end up contending on
// `svmd->svmd_lock`. Instead, as long as the I/O is small enough we'll tolerate
// it, copy from guest memory to Propolis heap. The segment backing Propolis'
// heap has an `as_page{,un}lock` impl that avoids the more
// expensive/contentious `as_fault()` fallback.
Comment on lines +757 to +762
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, i hate this a lot, but your rationale makes sense. i'll let you have it. :)

//
// This is an optimization until stlouis#871 can get things sorted, at
// which point it should be strictly worse than directly using the
// requested mappings.
//
// Beyond this, either fall back to using iovecs directly (and trust the kernel
// to handle bizarre vecs) or error. This is an especially gross hack because we
Comment on lines +768 to +769
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"bizarre" here just meaning "in excess of 16 megabytes"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worse: imagine a list of iovec where the ptr/len pair is just repeated 10000 times. if we did range coalescing in Propolis (i should do an issue about that actually), an evil driver could submit a write using 1MB (256 pages) but repeat that several hundred times to produce GiB+ writes kinda easily

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually since IOV_MAX is just 1024 this is less of a problem than i'd thought (... so we really ought to set MDTS to 10 lol)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worse: imagine a list of iovec where the ptr/len pair is just repeated 10000 times. if we did range coalescing in Propolis (i should do an issue about that actually), an evil driver could submit a write using 1MB (256 pages) but repeat that several hundred times to produce GiB+ writes kinda easily

oh, that's ghastly. ew. okay.

// could/should just set MDTS, which more naturally communicates to the guest
// the largest I/O we're willing to handle (and gives an out for too-large I/Os)
//
// The amount of memory used for temporary buffers is given by the number of
// worker threads for all file-backed disks, times this threshold. It works out
// to up to 128 MiB (8 worker threads) of buffers per disk by default as of
// writing.
const MAPPING_IO_LIMIT_BYTES: usize = 16 * crate::common::MB;
Copy link
Member Author

@iximeow iximeow Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with a Linux guest, dtrace says that this is a wildly large limit: when I have fio doing 16M I/Os, that turns into block_begin_write firing with 256kb of data to move (even though the controller tells the guest mdts=0 -> "no limit" on data transfer size). presumably Linux is breaking up the I/Os before making NVMe commands, but I don't plan on looking too closely.


impl<'a, T: AsRef<[SubMapping<'a>]>> MappingExt for T {
fn preadv(&self, fd: RawFd, offset: i64) -> Result<usize> {
if !self
Expand All @@ -767,23 +789,91 @@ impl<'a, T: AsRef<[SubMapping<'a>]>> MappingExt for T {
));
}

let iov = self
.as_ref()
.iter()
.map(|mapping| iovec {
iov_base: mapping.ptr.as_ptr() as *mut libc::c_void,
iov_len: mapping.len,
})
.collect::<Vec<_>>();

let read = unsafe {
libc::preadv(fd, iov.as_ptr(), iov.len() as libc::c_int, offset)
};
if read == -1 {
return Err(Error::last_os_error());
let mut total_capacity: usize = 0;
for mapping in self.as_ref().iter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It strikes me that this is basically a hand-written version of Iterator::try_reduce, which is sadly nightly-only still. However, it's close relative, try_fold is stable. Perhaps this might be phrased as:

    let total_capacity = self.as_ref()
        .iter()
        .try_fold(0usize, |total, mapping| total.checked_add(mapping.len))
        .ok_or_else(|| {
            Error::new(
                ErrorKind::InvalidInput,
                "sum of mapping lengths overflows usize",
            )
        })?;

total_capacity =
total_capacity.checked_add(mapping.len).ok_or_else(|| {
Error::new(
ErrorKind::InvalidInput,
"Total mapping too large",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, take it or leave it: perhaps

Suggested change
"Total mapping too large",
"Total length of all mappings would overflow a `usize`",

or something --- seeing this error would make me go "too large for what?"

)
})?;
}

Ok(read as usize)
// Gross hack: see the comment on `MAPPING_IO_LIMIT_BYTES`.
if total_capacity <= MAPPING_IO_LIMIT_BYTES {
// If we're motivated to avoid the zero-fill via
// `Layout::with_size_align` + `GlobalAlloc::alloc`, we should
// probably avoid this gross hack entirely (see comment on
// MAPPING_IO_LIMIT_BYTES).
Comment on lines +805 to +808
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair!

let mut buf = vec![0; total_capacity];

let iov = [iovec {
iov_base: buf.as_mut_ptr() as *mut libc::c_void,
iov_len: buf.len(),
}];

let read = unsafe {
libc::preadv(fd, iov.as_ptr(), iov.len() as libc::c_int, offset)
};
if read == -1 {
return Err(Error::last_os_error());
}
let read: usize = read.try_into().expect("read is positive");
Comment on lines +819 to +822
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, take it or leave it: we could, perhaps, avoid the panic here if we instead wrote:

Suggested change
if read == -1 {
return Err(Error::last_os_error());
}
let read: usize = read.try_into().expect("read is positive");
let read = match usize::try_from(read) {
Ok(read) => read,
Err(_) => return Err(Error::last_os_error()),
};

this way, we don't branch on the value of read twice, which...is maybe nice. it does mean we don't panic if preadv decides to return -420 or some shit, but...if we care about that, we could assert_eq!(read, -1) in the error path, i guess?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE: oh i also said a similarish thing last week: (https://github.com/oxidecomputer/propolis/pull/985/files#r2612098032, time is a flat circle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that suggestion, with a caveat: this is basically let read = usize::try_from(read).map_err(|_| Error::last_os_error())?;, isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually...thinking about this a bit more.... Perhaps we avoid direct use of libc to whatever extent that we can?

nix has a wrapper for preadv that does more sensible error handling that returns a Result over usize, which is exactly what we want. Simiarly, I saw @iliana recommend rustix the other day, which does similarly.

On a meta note, I'm not sure which crate is best here, but perhaps company-wide perhaps we should at least present a soft preference for something that's slightly higher-level than libc?


// copy `read` bytes back into the iovecs and return
let mut remaining = &mut buf[..read];
for mapping in self.as_ref().iter() {
let to_copy = std::cmp::min(remaining.len(), mapping.len);

// Safety: guest physical memory is READ|WRITE, and won't become
// unmapped as long as we hold a Mapping, which `self` implies.
//
// The guest may be concurrently modifying this memory, we may
// be concurrently reading or writing this memory (if it is
// submitted for a read or write on another thread, for
// example), but `copy_nonoverlapping` has no compunctions about
// concurrent mutation.
Comment on lines +835 to +836
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps worth noting explicitly that if a guest does decide it wants to screw up its own iovec for some reason, it would be pretty damn hard for LLVM to figure out a way to miscompile us in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also drop down to something like the volatile copy intrinsic, which is sadly not exposed in the standard library (presumably via core) as intrinsics are unstable, but that would require either nightly or shenanigans. I proposed something in rust-lang/rfcs#2728 a long while back, but encountered a lot of resistance and ran out of steam to push that change forward.

unsafe {
copy_nonoverlapping::<u8>(
remaining.as_ptr(),
mapping.ptr.as_ptr(),
mapping.len,
);
}

remaining = remaining.split_at_mut(to_copy).1;

if remaining.len() == 0 {
// Either we're at the last iov and we're finished copying
// back into the guest, or `preadv` did a short read.
break;
}
}

// We should never read more than the guest mappings could hold.
assert_eq!(remaining.len(), 0);

Ok(read)
} else {
let iov = self
.as_ref()
.iter()
.map(|mapping| iovec {
iov_base: mapping.ptr.as_ptr() as *mut libc::c_void,
iov_len: mapping.len,
})
.collect::<Vec<_>>();

let read = unsafe {
libc::preadv(fd, iov.as_ptr(), iov.len() as libc::c_int, offset)
};
if read == -1 {
return Err(Error::last_os_error());
}
let read: usize = read.try_into().expect("read is positive");
Comment on lines +868 to +874
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought we had someplace a thingy for doing a libc call and checking errno if the return value is -1, but after looking closer, it looks like this is just for ioctl; i wonder if it's worth having something like that for other libc functions like preadv (since we've got a bunch of similar things).

i'm imagining a thingy like

fn syscall_result(x: i32) -> Result<usize, io::Error> {
    usize::try_from(x).map_err(|_| io::Error::last_os_error())
}

not a blocker, naturally!

Ok(read)
}
}

fn pwritev(&self, fd: RawFd, offset: i64) -> Result<usize> {
Expand All @@ -798,18 +888,79 @@ impl<'a, T: AsRef<[SubMapping<'a>]>> MappingExt for T {
));
}

let iov = self
.as_ref()
.iter()
.map(|mapping| iovec {
iov_base: mapping.ptr.as_ptr() as *mut libc::c_void,
iov_len: mapping.len,
})
.collect::<Vec<_>>();
let mut total_capacity: usize = 0;
for mapping in self.as_ref().iter() {
total_capacity =
total_capacity.checked_add(mapping.len).ok_or_else(|| {
Error::new(
ErrorKind::InvalidInput,
"Total mapping too large",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, let's maybe say something like

Suggested change
"Total mapping too large",
"Total length of all mappings would overflow a `usize`",

)
})?;
}

let written = unsafe {
libc::pwritev(fd, iov.as_ptr(), iov.len() as libc::c_int, offset)
// Gross hack: see the comment on `MAPPING_IO_LIMIT_BYTES`.
let written = if total_capacity <= MAPPING_IO_LIMIT_BYTES {
// If we're motivated to avoid the zero-fill via
// `Layout::with_size_align` + `GlobalAlloc::alloc`, we should
// probably avoid this gross hack entirely (see comment on
// MAPPING_IO_LIMIT_BYTES).
let mut buf = vec![0; total_capacity];

let mut remaining = buf.as_mut_slice();
for mapping in self.as_ref().iter() {
// Safety: guest physical memory is READ|WRITE, and won't become
// unmapped as long as we hold a Mapping, which `self` implies.
//
// The guest may be concurrently modifying this memory, we may
// be concurrently reading or writing this memory (if it is
// submitted for a read or write on another thread, for
// example), but `copy_nonoverlapping` has no compunctions about
// concurrent mutation.
unsafe {
copy_nonoverlapping::<u8>(
mapping.ptr.as_ptr() as *const u8,
remaining.as_mut_ptr(),
mapping.len,
);
}

remaining = remaining.split_at_mut(mapping.len).1;
}

let iovs = [iovec {
iov_base: buf.as_mut_ptr() as *mut libc::c_void,
iov_len: buf.len(),
}];

unsafe {
libc::pwritev(
fd,
iovs.as_ptr(),
iovs.len() as libc::c_int,
offset,
)
}
} else {
let iovs = self
.as_ref()
.iter()
.map(|mapping| iovec {
iov_base: mapping.ptr.as_ptr() as *mut libc::c_void,
iov_len: mapping.len,
})
.collect::<Vec<_>>();

unsafe {
libc::pwritev(
fd,
iovs.as_ptr(),
iovs.len() as libc::c_int,
offset,
)
}
};

if written == -1 {
return Err(Error::last_os_error());
}
Expand Down
Loading