Skip to content

Conversation

@rfredette
Copy link
Contributor

During the investigation for OCPBUGS-61224, it was found that haproxy in the router containers occasionally segfaulted. However, the coredumps from those segfaults weren't saved, so a reason for the segfault couldn't be accurately determined. This change configures all nodes to save all coredumps, and any coredumps that were saved are copied to the artifacts directory at the end of the CI run. The router pod runs in the host namespace, so any crashes should be captured this way.

While this is intended to help debug router issues, it should also help in cases where crashes occur in any privileged pod.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 16, 2025
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-61792, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

During the investigation for OCPBUGS-61224, it was found that haproxy in the router containers occasionally segfaulted. However, the coredumps from those segfaults weren't saved, so a reason for the segfault couldn't be accurately determined. This change configures all nodes to save all coredumps, and any coredumps that were saved are copied to the artifacts directory at the end of the CI run. The router pod runs in the host namespace, so any crashes should be captured this way.

While this is intended to help debug router issues, it should also help in cases where crashes occur in any privileged pod.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from smg247 and stbenjam September 16, 2025 21:14
@openshift-ci-robot
Copy link
Contributor

@rfredette, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

could not load configuration from base revision of release repo: could not checkout worktree: '[git checkout f9263d7e2acab4f62a49a35b6609de844b78117f]' failed with out:  and error exec: Stdout already set
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@candita
Copy link
Contributor

candita commented Sep 17, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 17, 2025
@openshift-ci-robot
Copy link
Contributor

@candita: This pull request references Jira Issue OCPBUGS-61792, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from lihongan September 17, 2025 01:08
@candita
Copy link
Contributor

candita commented Sep 17, 2025

@rfredette do you need to split this into 2 PRs? I see an error in ci-operator-config that indicates it doesn't yet recognize the coredump-service you created within this PR:

time="2025-09-16T21:16:03Z" level=error error="failed to validate configuration openshift/release/openshift-release-master__ci-4.20-upgrade-from-stable-4.19.yaml: Failed resolve MultiStageTestConfiguration: test/e2e-azure-ovn-upgrade: workflow/openshift-upgrade-azure-ovn: invalid step reference: coredump-service"
time="2025-09-16T21:16:03Z" level=fatal msg="error validating configuration files"

Or is it this: https://github.com/openshift/release/pull/69344/files#r2353972823 ?

@lihongan
Copy link
Contributor

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@lihongan: This pull request references Jira Issue OCPBUGS-61792, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from ShudiLi September 17, 2025 02:06
@ShudiLi
Copy link
Member

ShudiLi commented Sep 17, 2025

/retest-required

@candita
Copy link
Contributor

candita commented Sep 17, 2025

/assign @alebedev87
/assign @grzpiotrowski

@rfredette rfredette force-pushed the collect-coredumps branch 2 times, most recently from c55febc to 0eab17e Compare September 17, 2025 17:08
@openshift-ci-robot
Copy link
Contributor

@rfredette, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

could not load configuration from base revision of release repo: could not checkout worktree: '[git checkout bf6b520905c17895694d01917e932d028e3e3144]' failed with out:  and error exec: Stdout already set
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@rfredette
Copy link
Contributor Author

/pj-rehearse list

@openshift-ci-robot
Copy link
Contributor

@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@lihongan
Copy link
Contributor

/pj-rehearse

@openshift-ci-robot
Copy link
Contributor

@lihongan: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ShudiLi
Copy link
Member

ShudiLi commented Sep 18, 2025

/retest-required

@rfredette
Copy link
Contributor Author

/retest

@openshift-ci-robot
Copy link
Contributor

@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@rfredette
Copy link
Contributor Author

/pj-rehearse

@openshift-ci-robot
Copy link
Contributor

@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@rfredette
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-azure-ovn-upgrade

@openshift-ci-robot
Copy link
Contributor

@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Configure all nodes to save coredumps, and collect any coredumps that
were saved during the gather-core-dump step.
@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@rfredette: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-router-master-e2e-agnostic openshift/router presubmit Ci-operator config changed
pull-ci-openshift-router-master-e2e-metal-ipi-ovn-dualstack openshift/router presubmit Ci-operator config changed
pull-ci-openshift-router-master-e2e-metal-ipi-ovn-ipv6 openshift/router presubmit Ci-operator config changed
pull-ci-openshift-router-master-e2e-metal-ipi-ovn-router openshift/router presubmit Ci-operator config changed
pull-ci-openshift-router-master-e2e-upgrade openshift/router presubmit Ci-operator config changed
pull-ci-openshift-router-master-perfscale-aws-fips-ingress-perf openshift/router presubmit Ci-operator config changed
pull-ci-openshift-router-master-perfscale-aws-ingress-perf openshift/router presubmit Ci-operator config changed
periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-azure-ovn-upgrade N/A periodic Ci-operator config changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@rfredette
Copy link
Contributor Author

/pj-rehearse

@openshift-ci-robot
Copy link
Contributor

@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@rfredette
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-azure-ovn-upgrade

@openshift-ci-robot
Copy link
Contributor

@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 24, 2025

@rfredette: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/kubevirt-ui/kubevirt-plugin/main/kubevirt-e2e-aws 3a97402 link unknown /pj-rehearse pull-ci-kubevirt-ui-kubevirt-plugin-main-kubevirt-e2e-aws
ci/rehearse/openshift/vmware-vsphere-csi-driver/master/okd-scos-e2e-aws-ovn 3a97402 link unknown /pj-rehearse pull-ci-openshift-vmware-vsphere-csi-driver-master-okd-scos-e2e-aws-ovn
ci/rehearse/openshift/vmware-vsphere-csi-driver/release-4.19/okd-scos-e2e-aws-ovn 3a97402 link unknown /pj-rehearse pull-ci-openshift-vmware-vsphere-csi-driver-release-4.19-okd-scos-e2e-aws-ovn
ci/rehearse/openshift/oc/main/e2e-agent-compact-ipv4 38a4b5a link unknown /pj-rehearse pull-ci-openshift-oc-main-e2e-agent-compact-ipv4
ci/rehearse/openshift/oc/release-4.22/e2e-agent-compact-ipv4 38a4b5a link unknown /pj-rehearse pull-ci-openshift-oc-release-4.22-e2e-agent-compact-ipv4
ci/rehearse/periodic-ci-3scale-qe-3scale-deploy-main-3scale-amp-ocp4.13-lp-interop-3scale-amp-interop-aws 17f16a6 link unknown /pj-rehearse periodic-ci-3scale-qe-3scale-deploy-main-3scale-amp-ocp4.13-lp-interop-3scale-amp-interop-aws
ci/rehearse/openshift/router/master/e2e-metal-ipi-ovn-ipv6 a05a1fa link unknown /pj-rehearse pull-ci-openshift-router-master-e2e-metal-ipi-ovn-ipv6
ci/rehearse/openshift/router/master/e2e-metal-ipi-ovn-router a05a1fa link unknown /pj-rehearse pull-ci-openshift-router-master-e2e-metal-ipi-ovn-router
ci/rehearse/openshift/router/master/e2e-metal-ipi-ovn-dualstack a05a1fa link unknown /pj-rehearse pull-ci-openshift-router-master-e2e-metal-ipi-ovn-dualstack
ci/rehearse/periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-azure-ovn-upgrade a05a1fa link unknown /pj-rehearse periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-azure-ovn-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@rfredette
Copy link
Contributor Author

Several rehearse jobs timed out, but in all cases, the enable-node-coredumps step I added is taking somewhere around 10-15 seconds. I don't think the timeouts are related to this change.

/pj-rehearse ack

@openshift-ci-robot
Copy link
Contributor

@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Oct 27, 2025
@rfredette
Copy link
Contributor Author

@jluhrsen please take another look!

I haven't been able to verify that this works in CI since we don't currently have a job that's segfaulting. That said, I've manually install the generated machine configs, cause a segfault, and run the must-gather commands on a clusterbot cluster, so it should work.

I'd like to merge this and test using my router PR, openshift/router#677 , if possible. I need to merge one of the 2 PRs in order to test this properly in CI. The router one can negatively affect other PRs' CI runs with segfault-related failures, but even if there is some reason the core dump collection doesn't work as intended, the pj-rehearse runs indicate that it'll be an innocuous change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to specify the post chain from the given workflow? gather-core-dump alone adde to post is not enough? Same question is for enable-node-coredumps pre step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did initially try just putting enable-node-coredumps in pre and gather-core-dump in post, but the tests failed in about 5 minutes with an error about missing some credentials. As I understand it, if you don't specify pre or post steps, there's a default based on the platform you choose, but specifying anything overrides that default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. It would be hard for me to access whether additional pre and post chains are correct for every job as I would have go through all of them. So I trust Ryan on this one. LGTM

@alebedev87
Copy link
Contributor

/approve

I let Jame give the LGTM.

@neisw
Copy link
Contributor

neisw commented Nov 18, 2025

/lgtm

Based upon #69344 (comment) and @jluhrsen being out on pto

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 18, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alebedev87, neisw, rfredette

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 18, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 98eff7d into openshift:master Nov 18, 2025
17 of 21 checks passed
@openshift-ci-robot
Copy link
Contributor

@rfredette: Jira Issue OCPBUGS-61792: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-61792 has been moved to the MODIFIED state.

Details

In response to this:

During the investigation for OCPBUGS-61224, it was found that haproxy in the router containers occasionally segfaulted. However, the coredumps from those segfaults weren't saved, so a reason for the segfault couldn't be accurately determined. This change configures all nodes to save all coredumps, and any coredumps that were saved are copied to the artifacts directory at the end of the CI run. The router pod runs in the host namespace, so any crashes should be captured this way.

While this is intended to help debug router issues, it should also help in cases where crashes occur in any privileged pod.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

namansharma18899 pushed a commit to namansharma18899/release that referenced this pull request Nov 24, 2025
Configure all nodes to save coredumps, and collect any coredumps that
were saved during the gather-core-dump step.
dfrazzette pushed a commit to dfrazzette/release that referenced this pull request Dec 9, 2025
Configure all nodes to save coredumps, and collect any coredumps that
were saved during the gather-core-dump step.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants