Skip to content

Conversation

@davepacheco
Copy link
Collaborator

This PR adds a more first-class file archival mechanism inside the debug collector within sled agent. The reason I did this is that in the past when I wanted to modify the files that Sled Agent collects, I found it tricky to do because:

  • it's hard to add support for collecting new kinds of files without touching the code related to collecting other files
  • that code combines both decision-making and execution logic, which makes it tricky to change
  • the tests for it are good end-to-end tests but very coarse

Altogether I basically felt like even when making a pretty small change, you essentially had to test in a real deployment, which is a much slower dev workflow than it needs to be (and it'd be very easy to break without breaking CI).

This is coming up because I'm planning to implement RFD 613 Debug Dropbox shortly.


After this PR, there's a new file_archiver module:

  • There's a list of rules (in the rules submodule) that describe what files to collect. I hope it's easy to add new things to this.
  • Archival is now implemented with the plan-execute pattern. Nearly all the behavior is in the planner so that it can be exhaustively tested without having to actually construct directory trees. (There's still an end-to-end smoke test that does actually verify what happens with files on disk.)
  • There's a pretty comprehensive test suite that's based on a list of paths found on real systems (from dogfood). It includes checks to make sure the dataset itself covers all the different rules so that if we extend the rules but forget to update the test data, it will fail tests.

It's arguably overengineered at this point but I'm hopeful that this will make it a lot easier to augment the set of files that get archived in this way.

As a first step, I tried to preserve the existing behavior as much as possible. There are several oddities that we might want to fix in follow-up work:

  • There's no debouncing, meaning that it's possible that the system could try to archive a rotated log file or core file or crash dump while it's still being written, resulting in losing the original file and having only a partial copy in the debug dataset.
  • I need to double-check this but it looks to me like the existing implementation overwrites existing crash dumps and core files. This would be buggy but not a huge deal for core files because their names are pretty unusual (they have pids and execnames and zonenames in them), but it seems likely to result in at most one crash dump kept per sled ever since they'll all be called vmdump.0.
  • Rotated log files include their original mtime as a Unix timestamp in the filename, but if there's already a file with that name, then the mtime is incremented until we find a file that does not exist. This is a little weird because it means this value is close to the mtime but not actually the mtime. (One behavior change in this PR is that we'll only check a max number of possible filenames -- 30 -- after that, we give up.)
  • Live log files wind up being called something.mtime instead of something.log.mtime the way that rotated log files do. This isn't a huge deal but does break oxlog (oxlog does not find archived live log files #9271). I'm not sure what we should do here. We could use the same convention but then we'd lose the distinction between live vs. rotated log files. I'm not sure if that's important.

@davepacheco davepacheco marked this pull request as draft December 20, 2025 04:37
@davepacheco
Copy link
Collaborator Author

Marking this draft because I still want to do some testing on a4x2 or a racklette.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants