Skip to content

Conversation

@rucoder
Copy link
Contributor

@rucoder rucoder commented Sep 2, 2025

Description

NOTE: this PR is made to real-time branch!

this PR also contains some changes that won't be merged to master branch, they are just handy for debugging

  1. Task list
  • integrate new runc and contanerd
  • update containerd config
  • update "user" containerd
  • update startup scripts
  • test eve onboarding
  • test app deployment
  • add a script to debug container to audit cgroup v2

PR dependencies

This PR depends on kernel PR lf-edge/eve-kernel#215

How to test and validate this PR

Changelog notes

PR Backports

- 14.5-stable: No, as the feature is not available there.
- 13.4-stable: No, as the feature is not available there.

Checklist

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've tested my PR on arm64 device
  • I've written the test verification instructions
  • I've set the proper labels to this PR

And the last but not least:

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

Please, check the boxes above after submitting the PR in interactive mode.

# the unified_cgroup_hierarchy forces cgroupsv1, which is required until pillar is ready to support v2
cmdline: "rootdelay=3 linuxkit.unified_cgroup_hierarchy=0"
# force cgroup v2
cmdline: "rootdelay=3 linuxkit.unified_cgroup_hierarchy=1 linuxkit.runc_console=1 linuxkit.runc_debug=1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to leave runc debugging and console enabled? Those are good for debugging, but it would be messy if left in permanently.

Also, it doesn't hurt to leave linuxkit.unified_cgroup_hierarchy=1, but you can remove it entirely, as this is the default.

Copy link
Contributor Author

@rucoder rucoder Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it doesn't hurt to leave linuxkit.unified_cgroup_hierarchy=1, but you can remove it entirely, as this is the default.

I wanted to make it explicit

Did you want to leave runc debugging and console enabled? Those are good for debugging, but it would be messy if left in permanently.

no and the commit message says that :) . real-time is a temporary branch for RT development

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to make it explicit

No problem there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I just realized you are merging this into the real-time branch; I thought it was going into master.

That begs the question: why is this going into real-time and not into master? We need to get this done, why not do it into master? Then the work for RT is just for RT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized you are merging this into the real-time branch; I thought it was going into master.

same. @rucoder can you change the title of the PR to something like:
[real-time] cgroups v2 support implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized you are merging this into the real-time branch; I thought it was going into master.

same. @rucoder can you change the title of the PR to something like: [real-time] cgroups v2 support implementation

I asked @OhmSpectator to create a label for this work "rt-wip"

capabilities:
- all
pid: host
cgroupns: host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I don't think this is a valid setting here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deitch yes, it seems bind mount is enough but let's investigate further

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, I think that there is no option for this currently in lkt pkg. It should belong in config, but there is nothing there. If this works, then that is great; if you need something added, then open an issue in lkt and I will take care of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, I think that there is no option for this currently in lkt pkg. It should belong in config, but there is nothing there. If this works, then that is great; if you need something added, then open an issue in lkt and I will take care of it.

@deitch I think bindNS can be it. you can specify how you bind namespaces under this block but I'm not sure

Copy link
Contributor Author

@rucoder rucoder Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deitch so bind mount is indeed enough and cgroupns: host has not effect , but i do not understand why it works. why container from child group can manipulate system groups? becasue it as capabilities: all so it is root with all possible permissions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. Could be that it causes it to do so. But if it works, it works; take the win. 😄

But be sure to remove the invalid cgroupNS from the config.

[ -f /persist/agentdebug/watcher/sigusr1 ] && cp /persist/agentdebug/watcher/sigusr1 "$DIR/goroutin-leak-detector-stacks-dump"
ln -s /run "$DIR/root-run"
cp -r /sys/fs/cgroup/memory "$DIR/sys-fs-cgroup-memory" >/dev/null 2>&1
# Collect cgroup v2 information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the version at the top of the collect-info.sh

Comment on lines 161 to 164
else
echo "ERROR: cgroup v2 not available - cgroup v1 is no longer supported"
exit 1
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be easier to read the diffs if you put this check at the top and then avoid the else branch indentation for the actual code (since you exit).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eriknordmark agree. will fix

@@ -0,0 +1,452 @@
# cgroup v2 Migration Plan and Analysis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this file will remain once the migration is done, but in any case, pkg/pillar/docs/ makes more sense as a location.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eriknordmark ah, I committed this file by mistake. I'll update docs later and I'll remove this file

@@ -0,0 +1,190 @@
// Copyright (c) 2024 Zededa, Inc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect this to run as part of tests? Then it needs to be in a directory with golang code AFAIU

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eriknordmark no, not really. I'll remove it. it was a test during development

Copy link
Contributor

@europaul europaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some small comments

"time"

"github.com/containerd/cgroups"
cgroupsv3 "github.com/containerd/cgroups/v3/cgroup2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to call it cgroup2 or cgroupsv2. I think v3 is really confusing here - I had to look it up to find out why it's called like that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@europaul yes, makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this really is confusing, but it is not @rucoder's fault. containerd has v1 and v2 of their config file, both of which apply to containerd v1.x, and then there is v3 of the config, which applies only to containerd v2. And don't ask how cgroups v1 and v2 match to it. 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deitch @europaul fixed. I was going to rename it but forgot.

require (
github.com/anatol/smart.go v0.0.0-20241126061019-f03d79b340d2
github.com/andrewd-zededa/go-libzfs v0.0.0-20240304231806-6a64e99da97d
github.com/containerd/cgroups v1.1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use github.com/containerd/cgroups/v3/cgroup2/stats here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@europaul sorry, I did not understand the question. this file is auto-generated and which commit you referring to? maybe it is updated later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@europaul ah, got it. I'll check. good catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@europaul I updated to v2 stats


"google.golang.org/grpc/connectivity"

"github.com/armon/circbuf"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a suggestion, maybe we could you the standard library here instead https://pkg.go.dev/container/ring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a suggestion, maybe we could you the standard library here instead https://pkg.go.dev/container/ring
You suggesting a ring buffer adapter and I need a writer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it just like this?

func (r *ringWriter) Write(bs []byte) error {
    r.ring.Value = bs
    r.ring = r.ring.Next()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christoph-zededa no, it is not. this will be a list of byte arrays and we canot guaranty anymore the size of final data structure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@europaul @christoph-zededa so this is very small dependency, I'm keeping it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, bytes is wrong, it should be runes, especially since people use emojis in log messages ;-)

Copy link
Member

@OhmSpectator OhmSpectator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is huge. It should come with an extensive test plan to check for regressions in any component.
I also don't understand why we need v2 and a specific RT branch in general. We tested RT before without any extra changes and it was ok.

event := cgroups.MemoryPressureEvent(cgroups.MediumPressure,
cgroups.HierarchyMode)
efd, err := controller.RegisterMemoryEvent(event)
memoryEventsFile := "/sys/fs/cgroup/eve/memory.events"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation is not 1-to-1 with the old logic... Should check it extra carefully.

// @minGrowthMemAbs and certain fraction from the last
// reclaim, so shortly:
// limit = MAX(minGrowthMemAbs, reclaimed * minGrowthMemPerc / 100)
interval, minGrowthMemAbs, minGrowthMemPerc := getForcedGOGCParams()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to remove the corresponding properties from the global config and update the doc.

@rucoder
Copy link
Contributor Author

rucoder commented Sep 3, 2025

The change is huge. It should come with an extensive test plan to check for regressions in any component. I also don't understand why we need v2 and a specific RT branch in general. We tested RT before without any extra changes and it was ok.

@OhmSpectator because I have extra debugging commits and i do not plan to "extensively test" now , beside memory-monitor is commented and not yet updated

cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr

// Create ring buffers to capture last 2KB of stdout and stderr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why, if the output is written anyways to stdout/stderr?
logrus.Errorf just writes to stderr again, doesn't it?

Copy link
Contributor Author

@rucoder rucoder Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christoph-zededa maybe because pillar is killed on following Fatal() call? maybe there is other way but I couldn't get error message from containerd otherwise . maybe I could just force flush()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@europaul @christoph-zededa no fix is needed here. I reverted all changes except for PID printing, that was me not parsing logs correctly

@deitch
Copy link
Contributor

deitch commented Sep 3, 2025

I also don't understand why we need v2 and a specific RT branch in general. We tested RT before without any extra changes and it was ok.

I second part of this. I think we should go to containerd v2 and runc v1.3.0, and we should go to cgroups v2. I also think we should not defer because it is a big change; we do that a bit too much (not blaming anyone, I am as much responsible as the next person; it just is reality).

I do think we should have an RT branch if there are RT things that are needed and unique, even just for testing and evaluation. But these changes (other than RT-specific) should go into the master branch, and should go as soon as possible. Especially after @rucoder did all of the work to figure this out.

@rucoder
Copy link
Contributor Author

rucoder commented Sep 3, 2025

I also don't understand why we need v2 and a specific RT branch in general. We tested RT before without any extra changes and it was ok.

I second part of this. I think we should go to containerd v2 and runc v1.3.0, and we should go to cgroups v2. I also think we should not defer because it is a big change; we do that a bit too much (not blaming anyone, I am as much responsible as the next person; it just is reality).

I do think we should have an RT branch if there are RT things that are needed and unique, even just for testing and evaluation. But these changes (other than RT-specific) should go into the master branch, and should go as soon as possible. Especially after @rucoder did all of the work to figure this out.

I do not disagree. I will post a PR to master when all components are converted to cgroup v2. For now memory monitor is not so we cannot make that PR but for RT memory-monitor is not an essential part as of now

@rucoder
Copy link
Contributor Author

rucoder commented Sep 8, 2025

/rerun red

@rucoder
Copy link
Contributor Author

rucoder commented Sep 11, 2025

as @OhmSpectator pointed out offline we need to check how consistent our memory settings are. Pillar may use kernel command line arguments to update memory limits, however the only source of truth is the value in config when we update memory limits and a file under /sys/fs/cgroups when we read a current limit.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run tests

@rucoder
Copy link
Contributor Author

rucoder commented Sep 17, 2025

@eriknordmark tests failed because this branch uses old kernel on CI. I'm almost done with RT kernel and will integrate it soon

@rucoder rucoder changed the base branch from real-time to feature/real-time September 18, 2025 11:46
@rucoder rucoder force-pushed the rucoder/cgroup-v2 branch 3 times, most recently from 614d30a to e0f2f2a Compare September 19, 2025 10:36
Comment on lines +149 to +155
# Guarantee /persist is writable before containerd starts; fallback to tmpfs if not mounted.
mkdir -p /mnt/rootfs/persist
if ! mountpoint -q /mnt/rootfs/persist; then
echo "init-initrd: /persist not mounted by mount_disk.sh; mounting tmpfs fallback"
mount -t tmpfs -o mode=0755,size=256M tmpfs /mnt/rootfs/persist
fi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this as part of this PR?

# 1) SEED CPUSET ON THE PARENT (root of the future tree) FIRST
eff_cpus="$(cat /mnt/rootfs/sys/fs/cgroup/cpuset.cpus.effective 2>/dev/null || echo "")"
eff_mems="$(cat /mnt/rootfs/sys/fs/cgroup/cpuset.mems.effective 2>/dev/null || echo "")"
[ -n "$eff_cpus" ] && echo "$eff_cpus" > /mnt/rootfs/sys/fs/cgroup/cpuset.cpus 2>/dev/null || true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be better to log some error in /dev/kmsg instead of this pattern?

ENV must have '=' in assignment and lists must be "'ed

- packages affected
pkg/dom0-ztools
pkg/xen-tools

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
…pressure

- Switch from cgroups v1 to v2:
  - domainmgr: use containerd/cgroups v3 (cgroup2) and set CPU mask via v2 manager.
  - watcher: replace eventfd/v1 logic with v2 polling:
    * parse memory.events (oom/oom_kill) and memory.pressure (PSI)
    * trigger Go GC under OOM/high pressure using existing thresholds.
  - containerd client + hypervisor: adopt v2 metrics (cgroup2/stats),
    adjust memory fields and convert CPU usage usec→nsec.
  - types/paths: move to memory.max/current/high/events/pressure under
    /sys/fs/cgroup/eve[/services/pillar]; kmem usage maps to current in v2.

Notes:
- /sys/fs/cgroup/eve tree is created/enabled by startup script (controllers set).
- PSI is guaranteed (our kernel has CONFIG_PSI=y).
- containerd >= 1.7 works; fleet is moving to 2.1.4 in a later commit.
- External v1 consumers will be migrated in follow-ups.

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
- vendor containerd/cgroups/v3 which itroduces cgroup v2 support

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
EVE runs outdated 1.1.x runc and 1.17 containerd. New versions have following benefits

- better support for cgroups v2
- cgropup hardening
- fixes for several CVEs
- better support for new containerd plugins which may be required for RT

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
For cgroups v2 the process cannot jump to a sibling cgroup so we
have to run pillar with access to host cgroups. To archive this we
mount host cgroups

- upgrade user.toml to v3 format
- fix missing directory in agentlog

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
The script print information about cgroups v2 in in nice readable form

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
- set linuxkit.unified_cgroup_hierarchy=1
it is not required because this is the default value but let's make it explicit

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
- this commit must not be merged to eve/master

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
- this commit must not be merged to eve/master

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
- Require cgroup v2 (`/sys/fs/cgroup/cgroup.controllers` must exist).
- Enable controllers at root (`+cpu +cpuset +memory +io +pids`).
- Create unified hierarchy under /sys/fs/cgroup/eve:
  * containerd
  * services/ (with sub-cgroups for sshd, pillar, newlogd, etc.)
- Parse kernel cmdline for dom0/eve/containerd cgroup memory and CPU limits
  (dom0_mem, eve_mem, ctrd_mem, *_max_vcpus).
- Apply limits using v2 files: memory.max, memory.high, cpuset.cpus, cpuset.mems.
- Default to 800M memory / 1 CPU if no kernel args are provided.
- Ensure service subgroups inherit proper CPU/mem from parent.

Notes:
- Kernel must have CONFIG_PSI=y for memory.pressure monitoring later.
- Replaces legacy v1 usage of memory.limit_in_bytes et al.
- Cgroup v1 is no longer supported (script exits if v2 is missing).

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
- 6.12 kernel seems to be bigger and doesn't fit into 270Mb

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants