OCPBUGS-61792: Collect coredumps on all nodes during CI runs #69344

rfredette · 2025-09-16T21:13:47Z

During the investigation for OCPBUGS-61224, it was found that haproxy in the router containers occasionally segfaulted. However, the coredumps from those segfaults weren't saved, so a reason for the segfault couldn't be accurately determined. This change configures all nodes to save all coredumps, and any coredumps that were saved are copied to the artifacts directory at the end of the CI run. The router pod runs in the host namespace, so any crashes should be captured this way.

While this is intended to help debug router issues, it should also help in cases where crashes occur in any privileged pod.

openshift-ci-robot · 2025-09-16T21:13:53Z

@rfredette: This pull request references Jira Issue OCPBUGS-61792, which is invalid:

expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

During the investigation for OCPBUGS-61224, it was found that haproxy in the router containers occasionally segfaulted. However, the coredumps from those segfaults weren't saved, so a reason for the segfault couldn't be accurately determined. This change configures all nodes to save all coredumps, and any coredumps that were saved are copied to the artifacts directory at the end of the CI run. The router pod runs in the host namespace, so any crashes should be captured this way.

While this is intended to help debug router issues, it should also help in cases where crashes occur in any privileged pod.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-09-16T21:25:54Z

@rfredette, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

could not load configuration from base revision of release repo: could not checkout worktree: '[git checkout f9263d7e2acab4f62a49a35b6609de844b78117f]' failed with out:  and error exec: Stdout already set

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

candita · 2025-09-17T01:08:37Z

/jira refresh

openshift-ci-robot · 2025-09-17T01:08:46Z

@candita: This pull request references Jira Issue OCPBUGS-61792, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

candita · 2025-09-17T01:11:58Z

@rfredette do you need to split this into 2 PRs? I see an error in ci-operator-config that indicates it doesn't yet recognize the coredump-service you created within this PR:

time="2025-09-16T21:16:03Z" level=error error="failed to validate configuration openshift/release/openshift-release-master__ci-4.20-upgrade-from-stable-4.19.yaml: Failed resolve MultiStageTestConfiguration: test/e2e-azure-ovn-upgrade: workflow/openshift-upgrade-azure-ovn: invalid step reference: coredump-service"
time="2025-09-16T21:16:03Z" level=fatal msg="error validating configuration files"

Or is it this: https://github.com/openshift/release/pull/69344/files#r2353972823 ?

ci-operator/step-registry/coredump/coredump-ref.yaml

lihongan · 2025-09-17T02:05:59Z

/jira refresh

openshift-ci-robot · 2025-09-17T02:06:03Z

@lihongan: This pull request references Jira Issue OCPBUGS-61792, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

ShudiLi · 2025-09-17T07:58:16Z

/retest-required

candita · 2025-09-17T15:10:20Z

/assign @alebedev87
/assign @grzpiotrowski

openshift-ci-robot · 2025-09-17T17:22:55Z

@rfredette, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

could not load configuration from base revision of release repo: could not checkout worktree: '[git checkout bf6b520905c17895694d01917e932d028e3e3144]' failed with out:  and error exec: Stdout already set

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

rfredette · 2025-09-17T20:35:48Z

/pj-rehearse list

openshift-ci-robot · 2025-09-17T20:35:51Z

@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

lihongan · 2025-09-18T01:19:45Z

/pj-rehearse

openshift-ci-robot · 2025-09-18T01:19:48Z

@lihongan: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

ShudiLi · 2025-09-18T06:10:32Z

/retest-required

rfredette · 2025-09-19T19:55:53Z

/retest

openshift-ci-robot · 2025-10-15T19:45:38Z

@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

rfredette · 2025-10-16T15:29:46Z

/pj-rehearse

openshift-ci-robot · 2025-10-16T15:29:49Z

@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

rfredette · 2025-10-16T20:33:51Z

/pj-rehearse periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-azure-ovn-upgrade

openshift-ci-robot · 2025-10-16T20:33:54Z

@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Configure all nodes to save coredumps, and collect any coredumps that were saved during the gather-core-dump step.

openshift-ci-robot · 2025-10-21T14:49:07Z

[REHEARSALNOTIFIER]
@rfredette: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name	Repo	Type	Reason
pull-ci-openshift-router-master-e2e-agnostic	openshift/router	presubmit	Ci-operator config changed
pull-ci-openshift-router-master-e2e-metal-ipi-ovn-dualstack	openshift/router	presubmit	Ci-operator config changed
pull-ci-openshift-router-master-e2e-metal-ipi-ovn-ipv6	openshift/router	presubmit	Ci-operator config changed
pull-ci-openshift-router-master-e2e-metal-ipi-ovn-router	openshift/router	presubmit	Ci-operator config changed
pull-ci-openshift-router-master-e2e-upgrade	openshift/router	presubmit	Ci-operator config changed
pull-ci-openshift-router-master-perfscale-aws-fips-ingress-perf	openshift/router	presubmit	Ci-operator config changed
pull-ci-openshift-router-master-perfscale-aws-ingress-perf	openshift/router	presubmit	Ci-operator config changed
periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-azure-ovn-upgrade	N/A	periodic	Ci-operator config changed

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

rfredette · 2025-10-21T15:17:50Z

/pj-rehearse

openshift-ci-robot · 2025-10-21T15:17:52Z

@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

rfredette · 2025-10-24T14:52:21Z

/pj-rehearse periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-azure-ovn-upgrade

openshift-ci-robot · 2025-10-24T14:52:26Z

@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci · 2025-10-24T19:11:37Z

@rfredette: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/rehearse/kubevirt-ui/kubevirt-plugin/main/kubevirt-e2e-aws	`3a97402`	link	unknown	`/pj-rehearse pull-ci-kubevirt-ui-kubevirt-plugin-main-kubevirt-e2e-aws`
ci/rehearse/openshift/vmware-vsphere-csi-driver/master/okd-scos-e2e-aws-ovn	`3a97402`	link	unknown	`/pj-rehearse pull-ci-openshift-vmware-vsphere-csi-driver-master-okd-scos-e2e-aws-ovn`
ci/rehearse/openshift/vmware-vsphere-csi-driver/release-4.19/okd-scos-e2e-aws-ovn	`3a97402`	link	unknown	`/pj-rehearse pull-ci-openshift-vmware-vsphere-csi-driver-release-4.19-okd-scos-e2e-aws-ovn`
ci/rehearse/openshift/oc/main/e2e-agent-compact-ipv4	`38a4b5a`	link	unknown	`/pj-rehearse pull-ci-openshift-oc-main-e2e-agent-compact-ipv4`
ci/rehearse/openshift/oc/release-4.22/e2e-agent-compact-ipv4	`38a4b5a`	link	unknown	`/pj-rehearse pull-ci-openshift-oc-release-4.22-e2e-agent-compact-ipv4`
ci/rehearse/periodic-ci-3scale-qe-3scale-deploy-main-3scale-amp-ocp4.13-lp-interop-3scale-amp-interop-aws	`17f16a6`	link	unknown	`/pj-rehearse periodic-ci-3scale-qe-3scale-deploy-main-3scale-amp-ocp4.13-lp-interop-3scale-amp-interop-aws`
ci/rehearse/openshift/router/master/e2e-metal-ipi-ovn-ipv6	`a05a1fa`	link	unknown	`/pj-rehearse pull-ci-openshift-router-master-e2e-metal-ipi-ovn-ipv6`
ci/rehearse/openshift/router/master/e2e-metal-ipi-ovn-router	`a05a1fa`	link	unknown	`/pj-rehearse pull-ci-openshift-router-master-e2e-metal-ipi-ovn-router`
ci/rehearse/openshift/router/master/e2e-metal-ipi-ovn-dualstack	`a05a1fa`	link	unknown	`/pj-rehearse pull-ci-openshift-router-master-e2e-metal-ipi-ovn-dualstack`
ci/rehearse/periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-azure-ovn-upgrade	`a05a1fa`	link	unknown	`/pj-rehearse periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-azure-ovn-upgrade`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

rfredette · 2025-10-27T15:26:07Z

Several rehearse jobs timed out, but in all cases, the enable-node-coredumps step I added is taking somewhere around 10-15 seconds. I don't think the timeouts are related to this change.

/pj-rehearse ack

openshift-ci-robot · 2025-10-27T15:26:10Z

@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

rfredette · 2025-10-27T18:33:52Z

@jluhrsen please take another look!

I haven't been able to verify that this works in CI since we don't currently have a job that's segfaulting. That said, I've manually install the generated machine configs, cause a segfault, and run the must-gather commands on a clusterbot cluster, so it should work.

I'd like to merge this and test using my router PR, openshift/router#677 , if possible. I need to merge one of the 2 PRs in order to test this properly in CI. The router one can negatively affect other PRs' CI runs with segfault-related failures, but even if there is some reason the core dump collection doesn't work as intended, the pj-rehearse runs indicate that it'll be an innocuous change.

alebedev87 · 2025-10-27T20:52:41Z

ci-operator/config/openshift/router/openshift-router-master.yaml

Do we have to specify the post chain from the given workflow? gather-core-dump alone adde to post is not enough? Same question is for enable-node-coredumps pre step.

I did initially try just putting enable-node-coredumps in pre and gather-core-dump in post, but the tests failed in about 5 minutes with an error about missing some credentials. As I understand it, if you don't specify pre or post steps, there's a default based on the platform you choose, but specifying anything overrides that default.

Ack. It would be hard for me to access whether additional pre and post chains are correct for every job as I would have go through all of them. So I trust Ryan on this one. LGTM

alebedev87 · 2025-10-28T16:57:50Z

/approve

I let Jame give the LGTM.

neisw · 2025-11-18T00:10:51Z

/lgtm

Based upon #69344 (comment) and @jluhrsen being out on pto

openshift-ci · 2025-11-18T00:11:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alebedev87, neisw, rfredette

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/config/openshift/release/OWNERS~~ [neisw]
~~ci-operator/config/openshift/router/OWNERS~~ [alebedev87,rfredette]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-11-18T00:30:24Z

@rfredette: Jira Issue OCPBUGS-61792: All pull requests linked via external trackers have merged:

openshift/release#69344

Jira Issue OCPBUGS-61792 has been moved to the MODIFIED state.

Details

In response to this:

During the investigation for OCPBUGS-61224, it was found that haproxy in the router containers occasionally segfaulted. However, the coredumps from those segfaults weren't saved, so a reason for the segfault couldn't be accurately determined. This change configures all nodes to save all coredumps, and any coredumps that were saved are copied to the artifacts directory at the end of the CI run. The router pod runs in the host namespace, so any crashes should be captured this way.

While this is intended to help debug router issues, it should also help in cases where crashes occur in any privileged pod.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Configure all nodes to save coredumps, and collect any coredumps that were saved during the gather-core-dump step.

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 16, 2025

openshift-ci bot requested review from smg247 and stbenjam September 16, 2025 21:14

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 17, 2025

openshift-ci bot requested a review from lihongan September 17, 2025 01:08

candita reviewed Sep 17, 2025

View reviewed changes

ci-operator/step-registry/coredump/coredump-ref.yaml Outdated Show resolved Hide resolved

openshift-ci bot requested a review from ShudiLi September 17, 2025 02:06

openshift-ci bot assigned alebedev87 and grzpiotrowski Sep 17, 2025

rfredette force-pushed the collect-coredumps branch 2 times, most recently from c55febc to 0eab17e Compare September 17, 2025 17:08

rfredette force-pushed the collect-coredumps branch from 0eab17e to 3a97402 Compare September 17, 2025 17:24

rfredette force-pushed the collect-coredumps branch from 906be21 to 60da054 Compare October 16, 2025 15:19

rfredette force-pushed the collect-coredumps branch from 60da054 to 0f09c41 Compare October 17, 2025 18:49

Collect coredumps on all nodes

a05a1fa

Configure all nodes to save coredumps, and collect any coredumps that were saved during the gather-core-dump step.

rfredette force-pushed the collect-coredumps branch from 0f09c41 to a05a1fa Compare October 21, 2025 14:44

openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Oct 27, 2025

alebedev87 reviewed Oct 27, 2025

View reviewed changes

openshift-ci bot assigned neisw Nov 18, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 18, 2025

openshift-merge-bot bot merged commit 98eff7d into openshift:master Nov 18, 2025
17 of 21 checks passed

namansharma18899 pushed a commit to namansharma18899/release that referenced this pull request Nov 24, 2025

Collect coredumps on all nodes (openshift#69344)

4e9102b

Configure all nodes to save coredumps, and collect any coredumps that were saved during the gather-core-dump step.

dfrazzette pushed a commit to dfrazzette/release that referenced this pull request Dec 9, 2025

Collect coredumps on all nodes (openshift#69344)

a161dde

Configure all nodes to save coredumps, and collect any coredumps that were saved during the gather-core-dump step.

OCPBUGS-61792: Collect coredumps on all nodes during CI runs #69344

OCPBUGS-61792: Collect coredumps on all nodes during CI runs #69344

Uh oh!

Conversation

rfredette commented Sep 16, 2025

Uh oh!

openshift-ci-robot commented Sep 16, 2025

Uh oh!

openshift-ci-robot commented Sep 16, 2025

Uh oh!

candita commented Sep 17, 2025

Uh oh!

openshift-ci-robot commented Sep 17, 2025

Uh oh!

candita commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

lihongan commented Sep 17, 2025

Uh oh!

openshift-ci-robot commented Sep 17, 2025

Uh oh!

ShudiLi commented Sep 17, 2025

Uh oh!

candita commented Sep 17, 2025

Uh oh!

openshift-ci-robot commented Sep 17, 2025

Uh oh!

rfredette commented Sep 17, 2025

Uh oh!

openshift-ci-robot commented Sep 17, 2025

Uh oh!

lihongan commented Sep 18, 2025

Uh oh!

openshift-ci-robot commented Sep 18, 2025

Uh oh!

ShudiLi commented Sep 18, 2025

Uh oh!

rfredette commented Sep 19, 2025

Uh oh!

openshift-ci-robot commented Oct 15, 2025

Uh oh!

rfredette commented Oct 16, 2025

Uh oh!

openshift-ci-robot commented Oct 16, 2025

Uh oh!

rfredette commented Oct 16, 2025

Uh oh!

openshift-ci-robot commented Oct 16, 2025

Uh oh!

openshift-ci-robot commented Oct 21, 2025

Uh oh!

rfredette commented Oct 21, 2025

Uh oh!

openshift-ci-robot commented Oct 21, 2025

Uh oh!

rfredette commented Oct 24, 2025

Uh oh!

openshift-ci-robot commented Oct 24, 2025

Uh oh!

openshift-ci bot commented Oct 24, 2025

Uh oh!

rfredette commented Oct 27, 2025

Uh oh!

openshift-ci-robot commented Oct 27, 2025

Uh oh!

rfredette commented Oct 27, 2025

Uh oh!

alebedev87 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

rfredette Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

alebedev87 Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

alebedev87 commented Oct 28, 2025

Uh oh!

neisw commented Nov 18, 2025

candita commented Sep 17, 2025 •

edited

Loading