-
Notifications
You must be signed in to change notification settings - Fork 177
[real-time] cgroups v2 support implementation #5202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/real-time
Are you sure you want to change the base?
Conversation
| # the unified_cgroup_hierarchy forces cgroupsv1, which is required until pillar is ready to support v2 | ||
| cmdline: "rootdelay=3 linuxkit.unified_cgroup_hierarchy=0" | ||
| # force cgroup v2 | ||
| cmdline: "rootdelay=3 linuxkit.unified_cgroup_hierarchy=1 linuxkit.runc_console=1 linuxkit.runc_debug=1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to leave runc debugging and console enabled? Those are good for debugging, but it would be messy if left in permanently.
Also, it doesn't hurt to leave linuxkit.unified_cgroup_hierarchy=1, but you can remove it entirely, as this is the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it doesn't hurt to leave linuxkit.unified_cgroup_hierarchy=1, but you can remove it entirely, as this is the default.
I wanted to make it explicit
Did you want to leave runc debugging and console enabled? Those are good for debugging, but it would be messy if left in permanently.
no and the commit message says that :) . real-time is a temporary branch for RT development
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to make it explicit
No problem there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I just realized you are merging this into the real-time branch; I thought it was going into master.
That begs the question: why is this going into real-time and not into master? We need to get this done, why not do it into master? Then the work for RT is just for RT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized you are merging this into the real-time branch; I thought it was going into master.
same. @rucoder can you change the title of the PR to something like:
[real-time] cgroups v2 support implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized you are merging this into the real-time branch; I thought it was going into master.
same. @rucoder can you change the title of the PR to something like:
[real-time] cgroups v2 support implementation
I asked @OhmSpectator to create a label for this work "rt-wip"
pkg/pillar/build.yml
Outdated
| capabilities: | ||
| - all | ||
| pid: host | ||
| cgroupns: host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I don't think this is a valid setting here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deitch yes, it seems bind mount is enough but let's investigate further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed offline, I think that there is no option for this currently in lkt pkg. It should belong in config, but there is nothing there. If this works, then that is great; if you need something added, then open an issue in lkt and I will take care of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed offline, I think that there is no option for this currently in lkt pkg. It should belong in
config, but there is nothing there. If this works, then that is great; if you need something added, then open an issue in lkt and I will take care of it.
@deitch I think bindNS can be it. you can specify how you bind namespaces under this block but I'm not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deitch so bind mount is indeed enough and cgroupns: host has not effect , but i do not understand why it works. why container from child group can manipulate system groups? becasue it as capabilities: all so it is root with all possible permissions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. Could be that it causes it to do so. But if it works, it works; take the win. 😄
But be sure to remove the invalid cgroupNS from the config.
| [ -f /persist/agentdebug/watcher/sigusr1 ] && cp /persist/agentdebug/watcher/sigusr1 "$DIR/goroutin-leak-detector-stacks-dump" | ||
| ln -s /run "$DIR/root-run" | ||
| cp -r /sys/fs/cgroup/memory "$DIR/sys-fs-cgroup-memory" >/dev/null 2>&1 | ||
| # Collect cgroup v2 information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the version at the top of the collect-info.sh
| else | ||
| echo "ERROR: cgroup v2 not available - cgroup v1 is no longer supported" | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be easier to read the diffs if you put this check at the top and then avoid the else branch indentation for the actual code (since you exit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eriknordmark agree. will fix
pkg/pillar/cgroup-v2-migration.md
Outdated
| @@ -0,0 +1,452 @@ | |||
| # cgroup v2 Migration Plan and Analysis | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this file will remain once the migration is done, but in any case, pkg/pillar/docs/ makes more sense as a location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eriknordmark ah, I committed this file by mistake. I'll update docs later and I'll remove this file
pkg/pillar/cgroup_v2_test.go
Outdated
| @@ -0,0 +1,190 @@ | |||
| // Copyright (c) 2024 Zededa, Inc. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect this to run as part of tests? Then it needs to be in a directory with golang code AFAIU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eriknordmark no, not really. I'll remove it. it was a test during development
europaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some small comments
| "time" | ||
|
|
||
| "github.com/containerd/cgroups" | ||
| cgroupsv3 "github.com/containerd/cgroups/v3/cgroup2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to call it cgroup2 or cgroupsv2. I think v3 is really confusing here - I had to look it up to find out why it's called like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@europaul yes, makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this really is confusing, but it is not @rucoder's fault. containerd has v1 and v2 of their config file, both of which apply to containerd v1.x, and then there is v3 of the config, which applies only to containerd v2. And don't ask how cgroups v1 and v2 match to it. 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/pillar/go.mod
Outdated
| require ( | ||
| github.com/anatol/smart.go v0.0.0-20241126061019-f03d79b340d2 | ||
| github.com/andrewd-zededa/go-libzfs v0.0.0-20240304231806-6a64e99da97d | ||
| github.com/containerd/cgroups v1.1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use github.com/containerd/cgroups/v3/cgroup2/stats here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@europaul sorry, I did not understand the question. this file is auto-generated and which commit you referring to? maybe it is updated later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@europaul ah, got it. I'll check. good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@europaul I updated to v2 stats
pkg/pillar/containerd/containerd.go
Outdated
|
|
||
| "google.golang.org/grpc/connectivity" | ||
|
|
||
| "github.com/armon/circbuf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a suggestion, maybe we could you the standard library here instead https://pkg.go.dev/container/ring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a suggestion, maybe we could you the standard library here instead https://pkg.go.dev/container/ring
You suggesting a ring buffer adapter and I need a writer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it just like this?
func (r *ringWriter) Write(bs []byte) error {
r.ring.Value = bs
r.ring = r.ring.Next()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christoph-zededa no, it is not. this will be a list of byte arrays and we canot guaranty anymore the size of final data structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@europaul @christoph-zededa so this is very small dependency, I'm keeping it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, bytes is wrong, it should be runes, especially since people use emojis in log messages ;-)
OhmSpectator
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change is huge. It should come with an extensive test plan to check for regressions in any component.
I also don't understand why we need v2 and a specific RT branch in general. We tested RT before without any extra changes and it was ok.
| event := cgroups.MemoryPressureEvent(cgroups.MediumPressure, | ||
| cgroups.HierarchyMode) | ||
| efd, err := controller.RegisterMemoryEvent(event) | ||
| memoryEventsFile := "/sys/fs/cgroup/eve/memory.events" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation is not 1-to-1 with the old logic... Should check it extra carefully.
| // @minGrowthMemAbs and certain fraction from the last | ||
| // reclaim, so shortly: | ||
| // limit = MAX(minGrowthMemAbs, reclaimed * minGrowthMemPerc / 100) | ||
| interval, minGrowthMemAbs, minGrowthMemPerc := getForcedGOGCParams() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to remove the corresponding properties from the global config and update the doc.
@OhmSpectator because I have extra debugging commits and i do not plan to "extensively test" now , beside memory-monitor is commented and not yet updated |
pkg/pillar/containerd/containerd.go
Outdated
| cmd.Stdout = os.Stdout | ||
| cmd.Stderr = os.Stderr | ||
|
|
||
| // Create ring buffers to capture last 2KB of stdout and stderr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why, if the output is written anyways to stdout/stderr?
logrus.Errorf just writes to stderr again, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christoph-zededa maybe because pillar is killed on following Fatal() call? maybe there is other way but I couldn't get error message from containerd otherwise . maybe I could just force flush()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@europaul @christoph-zededa no fix is needed here. I reverted all changes except for PID printing, that was me not parsing logs correctly
I second part of this. I think we should go to containerd v2 and runc v1.3.0, and we should go to cgroups v2. I also think we should not defer because it is a big change; we do that a bit too much (not blaming anyone, I am as much responsible as the next person; it just is reality). I do think we should have an RT branch if there are RT things that are needed and unique, even just for testing and evaluation. But these changes (other than RT-specific) should go into the |
I do not disagree. I will post a PR to master when all components are converted to cgroup v2. For now memory monitor is not so we cannot make that PR but for RT memory-monitor is not an essential part as of now |
ba50898 to
53f7e07
Compare
1c8bfa2 to
cbffa3f
Compare
|
/rerun red |
|
as @OhmSpectator pointed out offline we need to check how consistent our memory settings are. Pillar may use kernel command line arguments to update memory limits, however the only source of truth is the value in config when we update memory limits and a file under /sys/fs/cgroups when we read a current limit. |
eriknordmark
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run tests
|
@eriknordmark tests failed because this branch uses old kernel on CI. I'm almost done with RT kernel and will integrate it soon |
6333032 to
1cea2fc
Compare
614d30a to
e0f2f2a
Compare
| # Guarantee /persist is writable before containerd starts; fallback to tmpfs if not mounted. | ||
| mkdir -p /mnt/rootfs/persist | ||
| if ! mountpoint -q /mnt/rootfs/persist; then | ||
| echo "init-initrd: /persist not mounted by mount_disk.sh; mounting tmpfs fallback" | ||
| mount -t tmpfs -o mode=0755,size=256M tmpfs /mnt/rootfs/persist | ||
| fi | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this as part of this PR?
| # 1) SEED CPUSET ON THE PARENT (root of the future tree) FIRST | ||
| eff_cpus="$(cat /mnt/rootfs/sys/fs/cgroup/cpuset.cpus.effective 2>/dev/null || echo "")" | ||
| eff_mems="$(cat /mnt/rootfs/sys/fs/cgroup/cpuset.mems.effective 2>/dev/null || echo "")" | ||
| [ -n "$eff_cpus" ] && echo "$eff_cpus" > /mnt/rootfs/sys/fs/cgroup/cpuset.cpus 2>/dev/null || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be better to log some error in /dev/kmsg instead of this pattern?
e0f2f2a to
8039137
Compare
ENV must have '=' in assignment and lists must be "'ed - packages affected pkg/dom0-ztools pkg/xen-tools Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
…pressure
- Switch from cgroups v1 to v2:
- domainmgr: use containerd/cgroups v3 (cgroup2) and set CPU mask via v2 manager.
- watcher: replace eventfd/v1 logic with v2 polling:
* parse memory.events (oom/oom_kill) and memory.pressure (PSI)
* trigger Go GC under OOM/high pressure using existing thresholds.
- containerd client + hypervisor: adopt v2 metrics (cgroup2/stats),
adjust memory fields and convert CPU usage usec→nsec.
- types/paths: move to memory.max/current/high/events/pressure under
/sys/fs/cgroup/eve[/services/pillar]; kmem usage maps to current in v2.
Notes:
- /sys/fs/cgroup/eve tree is created/enabled by startup script (controllers set).
- PSI is guaranteed (our kernel has CONFIG_PSI=y).
- containerd >= 1.7 works; fleet is moving to 2.1.4 in a later commit.
- External v1 consumers will be migrated in follow-ups.
Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
- vendor containerd/cgroups/v3 which itroduces cgroup v2 support Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
EVE runs outdated 1.1.x runc and 1.17 containerd. New versions have following benefits - better support for cgroups v2 - cgropup hardening - fixes for several CVEs - better support for new containerd plugins which may be required for RT Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
For cgroups v2 the process cannot jump to a sibling cgroup so we have to run pillar with access to host cgroups. To archive this we mount host cgroups - upgrade user.toml to v3 format - fix missing directory in agentlog Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
The script print information about cgroups v2 in in nice readable form Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
- set linuxkit.unified_cgroup_hierarchy=1 it is not required because this is the default value but let's make it explicit Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
- this commit must not be merged to eve/master Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
- this commit must not be merged to eve/master Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
- Require cgroup v2 (`/sys/fs/cgroup/cgroup.controllers` must exist). - Enable controllers at root (`+cpu +cpuset +memory +io +pids`). - Create unified hierarchy under /sys/fs/cgroup/eve: * containerd * services/ (with sub-cgroups for sshd, pillar, newlogd, etc.) - Parse kernel cmdline for dom0/eve/containerd cgroup memory and CPU limits (dom0_mem, eve_mem, ctrd_mem, *_max_vcpus). - Apply limits using v2 files: memory.max, memory.high, cpuset.cpus, cpuset.mems. - Default to 800M memory / 1 CPU if no kernel args are provided. - Ensure service subgroups inherit proper CPU/mem from parent. Notes: - Kernel must have CONFIG_PSI=y for memory.pressure monitoring later. - Replaces legacy v1 usage of memory.limit_in_bytes et al. - Cgroup v1 is no longer supported (script exits if v2 is missing). Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
- 6.12 kernel seems to be bigger and doesn't fit into 270Mb Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
8039137 to
4feab1f
Compare
Description
NOTE: this PR is made to real-time branch!
this PR also contains some changes that won't be merged to master branch, they are just handy for debugging
PR dependencies
This PR depends on kernel PR lf-edge/eve-kernel#215
How to test and validate this PR
Changelog notes
PR Backports
Checklist
And the last but not least:
check them.
Please, check the boxes above after submitting the PR in interactive mode.