Skip to content

NVMe devices don't tolerate migration #981

@iximeow

Description

@iximeow

there are at least two bugs here, one much more obvious than the other.. I was trying to tickle a different but I suspect exists, but have been stymied by these for the time being.

I got here via ./propolis-standalone -s ubuntu.22.04.5.toml, ctrl+c, and ./propolis-standalone -r <snapshot_tar> with a file-backed disk and a mem-async disk, but the disk types shouldn't really matter (both are wedged)

issue one is that NVMe devices now have an is_enabled to avoid locking NvmeCtrl to read ctrl.cc on hot paths. that's great, except when we import a device we don't peel out is_enabled again, so the imported controller has cc declaring it's enabled, but is functionally disabled. attempting to submit I/O (when run via propolis-standalone) then gets you nvme reg r/w failure, error: the submission queue specified (4) is invalid, register: IOQueueDoorBells, offset: 4128 in ring_doorbell.

adding a

if ctrl.ctrl.cc.enabled() {
    self.is_enabled.store(true, Ordering::Releae);
}

right after

ctrl.import(input, self)?;

gets you past that and onto the second issue, which I've not debugged at all: I/Os seem to just not get acked (maybe not serviced?) on a migrated controller. it may be relevant that I've also got a local patch for const MAX_NUM_QUEUES: usize = 5; but I've not tested. we'll see how cc43ca0 fares!

edit/ps: #966 didn't trip over this because we use state.ctrl.cc.enabled() on the admin SQ/CQ doorbells, but self.is_enabled on the I/O doorbells, and the rudimentary driver there doesn't actually do I/O. so.. probably worth making it do some I/Os..

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that isn't working.migrationIssues related to live migration.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions