-
Notifications
You must be signed in to change notification settings - Fork 2k
OCPBUGS-61792: Collect coredumps on all nodes during CI runs #69344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-61792: Collect coredumps on all nodes during CI runs #69344
Conversation
|
@rfredette: This pull request references Jira Issue OCPBUGS-61792, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@rfredette, Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/jira refresh |
|
@candita: This pull request references Jira Issue OCPBUGS-61792, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@rfredette do you need to split this into 2 PRs? I see an error in ci-operator-config that indicates it doesn't yet recognize the coredump-service you created within this PR:
Or is it this: https://github.com/openshift/release/pull/69344/files#r2353972823 ? |
|
/jira refresh |
|
@lihongan: This pull request references Jira Issue OCPBUGS-61792, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
|
/assign @alebedev87 |
c55febc to
0eab17e
Compare
|
@rfredette, Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
0eab17e to
3a97402
Compare
|
/pj-rehearse list |
|
@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse |
|
@lihongan: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/retest-required |
|
/retest |
|
@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
906be21 to
60da054
Compare
|
/pj-rehearse |
|
@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-azure-ovn-upgrade |
|
@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
60da054 to
0f09c41
Compare
Configure all nodes to save coredumps, and collect any coredumps that were saved during the gather-core-dump step.
0f09c41 to
a05a1fa
Compare
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse |
|
@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-azure-ovn-upgrade |
|
@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@rfredette: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Several rehearse jobs timed out, but in all cases, the /pj-rehearse ack |
|
@rfredette: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@jluhrsen please take another look! I haven't been able to verify that this works in CI since we don't currently have a job that's segfaulting. That said, I've manually install the generated machine configs, cause a segfault, and run the must-gather commands on a clusterbot cluster, so it should work. I'd like to merge this and test using my router PR, openshift/router#677 , if possible. I need to merge one of the 2 PRs in order to test this properly in CI. The router one can negatively affect other PRs' CI runs with segfault-related failures, but even if there is some reason the core dump collection doesn't work as intended, the pj-rehearse runs indicate that it'll be an innocuous change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to specify the post chain from the given workflow? gather-core-dump alone adde to post is not enough? Same question is for enable-node-coredumps pre step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did initially try just putting enable-node-coredumps in pre and gather-core-dump in post, but the tests failed in about 5 minutes with an error about missing some credentials. As I understand it, if you don't specify pre or post steps, there's a default based on the platform you choose, but specifying anything overrides that default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. It would be hard for me to access whether additional pre and post chains are correct for every job as I would have go through all of them. So I trust Ryan on this one. LGTM
|
/approve I let Jame give the LGTM. |
|
/lgtm Based upon #69344 (comment) and @jluhrsen being out on pto |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alebedev87, neisw, rfredette The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
98eff7d
into
openshift:master
|
@rfredette: Jira Issue OCPBUGS-61792: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-61792 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Configure all nodes to save coredumps, and collect any coredumps that were saved during the gather-core-dump step.
Configure all nodes to save coredumps, and collect any coredumps that were saved during the gather-core-dump step.
During the investigation for OCPBUGS-61224, it was found that haproxy in the router containers occasionally segfaulted. However, the coredumps from those segfaults weren't saved, so a reason for the segfault couldn't be accurately determined. This change configures all nodes to save all coredumps, and any coredumps that were saved are copied to the artifacts directory at the end of the CI run. The router pod runs in the host namespace, so any crashes should be captured this way.
While this is intended to help debug router issues, it should also help in cases where crashes occur in any privileged pod.