From e63b16510bb52107afd5f8b37427fae945c14daa Mon Sep 17 00:00:00 2001 From: Justin Pierce Date: Fri, 19 Sep 2025 12:12:59 -0400 Subject: [PATCH] Docs updates based on helpdesk --- content/en/docs/_index.md | 94 +++- .../en/docs/architecture/branch-protection.md | 28 +- content/en/docs/architecture/ci-operator.md | 80 ++- content/en/docs/architecture/step-registry.md | 32 +- content/en/docs/getting-started/_index.md | 74 +++ content/en/docs/getting-started/concepts.md | 100 ++++ content/en/docs/getting-started/examples.md | 522 ++++++++++++++---- content/en/docs/getting-started/glossary.md | 201 +++++++ .../en/docs/getting-started/helpdesk-faq.md | 194 ++++++- .../docs/getting-started/quick-reference.md | 379 +++++++++++++ .../en/docs/getting-started/simple-example.md | 176 ++++++ .../getting-started/writing-first-test.md | 216 ++++++++ content/en/docs/how-tos/_index.md | 93 +++- .../how-tos/onboarding-a-new-component.md | 57 +- content/en/docs/troubleshooting/_index.md | 124 +++++ .../en/docs/troubleshooting/access-issues.md | 381 +++++++++++++ .../docs/troubleshooting/cluster-problems.md | 402 ++++++++++++++ .../troubleshooting/configuration-issues.md | 340 ++++++++++++ .../troubleshooting/debugging-failed-jobs.md | 289 ++++++++++ .../troubleshooting/job-execution-issues.md | 362 ++++++++++++ 20 files changed, 3984 insertions(+), 160 deletions(-) create mode 100644 content/en/docs/getting-started/concepts.md create mode 100644 content/en/docs/getting-started/glossary.md create mode 100644 content/en/docs/getting-started/quick-reference.md create mode 100644 content/en/docs/getting-started/simple-example.md create mode 100644 content/en/docs/getting-started/writing-first-test.md create mode 100644 content/en/docs/troubleshooting/_index.md create mode 100644 content/en/docs/troubleshooting/access-issues.md create mode 100644 content/en/docs/troubleshooting/cluster-problems.md create mode 100644 content/en/docs/troubleshooting/configuration-issues.md create mode 100644 content/en/docs/troubleshooting/debugging-failed-jobs.md create mode 100644 content/en/docs/troubleshooting/job-execution-issues.md diff --git a/content/en/docs/_index.md b/content/en/docs/_index.md index 4a93c404..0822d68f 100644 --- a/content/en/docs/_index.md +++ b/content/en/docs/_index.md @@ -3,44 +3,98 @@ title: "Home" no_list: true --- +# OpenShift CI Documentation + +Welcome to the OpenShift CI documentation! This site helps you understand and use our CI/CD platform effectively. + +## 🎯 Quick Start Navigation + {{% card %}} -### Onboarding +### New to OpenShift CI? + +Start your journey here: +1. **[Core Concepts](/docs/getting-started/concepts)** - Understand the basics +2. **[Writing Your First Test](/docs/getting-started/writing-first-test)** - Hands-on tutorial +3. **[Examples](/docs/getting-started/examples)** - Common patterns and use cases -* [Onboarding a New Component for Testing and Merge Automation](/docs/how-tos/onboarding-a-new-component) -* [Testing Operators Built With The Operator SDK and Deployed Through OLM](/docs/how-tos/testing-operator-sdk-operators) -* [Contributing CI Configuration to the openshift/release Repository](/docs/how-tos/contributing-openshift-release) +**Popular tasks:** +- [Onboarding a New Component](/docs/how-tos/onboarding-a-new-component) +- [Creating a Test Pipeline](/docs/how-tos/creating-a-pipeline) +- [Using the Step Registry](https://steps.ci.openshift.org/) {{% /card %}} {{% card %}} -### Writing New Jobs +### Having Issues? -* [Add a Job to TestGrid](/docs/how-tos/add-jobs-to-testgrid) -* [Adding and Changing Step Registry Content](/docs/how-tos/adding-changing-step-registry-content/) -* [Migrating CI Jobs from Templates to Multi-stage Tests](/docs/how-tos/migrating-template-jobs-to-multistage) -* [Set Up Slack Alerts for Periodic Job Results](/docs/how-tos/notification) -* [Using External Images in CI](/docs/how-tos/external-images) +Our comprehensive troubleshooting guide covers: +- **[Job Won't Start](/docs/troubleshooting/job-execution-issues)** - Triggering and scheduling problems +- **[Test Failures](/docs/troubleshooting/debugging-failed-jobs)** - Debug failed tests +- **[Cluster Issues](/docs/troubleshooting/cluster-problems)** - Installation and access problems +- **[Configuration Errors](/docs/troubleshooting/configuration-issues)** - YAML and setup issues + +**[View Complete Troubleshooting Guide β†’](/docs/troubleshooting/)** {{% /card %}} +{{% card %}} +### Common How-To's + +**Testing & Development** +- [Add a Job to TestGrid](/docs/how-tos/add-jobs-to-testgrid) +- [Set Up Slack Notifications](/docs/how-tos/notification) +- [Using External Images in CI](/docs/how-tos/external-images) +- [Multi-Architecture Testing](/docs/how-tos/multi-architecture) + +**Configuration & Setup** +- [Adding Step Registry Content](/docs/how-tos/adding-changing-step-registry-content/) +- [Adding a New Secret to CI](/docs/how-tos/adding-a-new-secret-to-ci) +- [Migrating to Multi-stage Tests](/docs/how-tos/migrating-template-jobs-to-multistage) + +**[Browse All How-To Guides β†’](/docs/how-tos/)** +{{% /card %}} {{% card %}} -### Working with ART +### Architecture & Deep Dives -* [Private Repositories and Fixing Embargoed CVEs](/docs/architecture/private-repositories) -* [OCP Builder Images](/docs/architecture/images) +**Core Components** +- [Multi-Stage Tests and Step Registry](/docs/architecture/step-registry) +- [CI Operator](/docs/architecture/ci-operator) - The test orchestration engine +- [Job Timeouts and Interruptions](/docs/architecture/timeouts) + +**OpenShift Integration** +- [Private Repositories and CVEs](/docs/architecture/private-repositories) +- [OCP Builder Images](/docs/architecture/images) +- [Branch Protection](/docs/architecture/branch-protection) {{% /card %}} {{% card %}} -### Architecture and References +### Release & Quality + +**Release Management** +- [Centralized Release Branching](/docs/architecture/branching) +- [Extending Release Gates](/docs/architecture/release-gating) +- [Component Readiness](/docs/release-oversight/component-readiness) -* [Multi-Stage Tests and the Test Step Registry](/docs/architecture/step-registry) -* [Handling Job Timeouts and Interruptions](/docs/architecture/timeouts) -* [CI Operator](/docs/architecture/ci-operator) +**Working with ART** +- [The Technical Release Team](/docs/release-oversight/the-technical-release-team) +- [Payload Testing](/docs/release-oversight/payload-testing) +- [Improving CI Signal](/docs/release-oversight/improving-ci-signal) {{% /card %}} {{% card %}} -### Branching and Release -* [Centralized Release Branching and Config Management](/docs/architecture/branching) -* [Extending OpenShift Release Gates](/docs/architecture/release-gating) +### Get Help + +**Self-Service Resources** +- πŸ“š [FAQ](/docs/getting-started/helpdesk-faq) - Frequently asked questions +- πŸ” [Search CI Logs](https://search.ci.openshift.org/) - Find error messages +- πŸ“Š [Sippy](https://sippy.dptools.openshift.org/) - Job health dashboard +- πŸ”— [Useful Links](/docs/getting-started/useful-links) - Tools and services + +**Contact Us** +- πŸ’¬ Slack: `#forum-ocp-testplatform` +- 🎫 [File a Jira Ticket](https://issues.redhat.com/projects/DPTP) +- πŸ“’ Announcements: `#announce-testplatform` + +**Emergency?** Use the "Report CI Outage" workflow in Slack. {{% /card %}}































diff --git a/content/en/docs/architecture/branch-protection.md b/content/en/docs/architecture/branch-protection.md index c29bfaf2..e6e68fb0 100644 --- a/content/en/docs/architecture/branch-protection.md +++ b/content/en/docs/architecture/branch-protection.md @@ -4,8 +4,14 @@ description: An overview of the integration between OpenShift CI and GitHub bran --- ## What Is Branch Protection? -Branch protection is a repository setting that enforces certain rules when trying to merge to a protected branch. See -the [GitHub documentation](https://docs.github.com/en/free-pro-team@latest/github/administering-a-repository/about-protected-branches) +Branch protection is a GitHub feature that helps maintain code quality by preventing direct pushes to important branches (like `main` or `master`). Instead, changes must go through pull requests and pass required checks. + +Think of it as a safety gate - you can't accidentally break the main branch because: +- All changes must be reviewed +- Required tests must pass +- The code must be up-to-date with the target branch + +See the [GitHub documentation](https://docs.github.com/en/free-pro-team@latest/github/administering-a-repository/about-protected-branches) for more details. {{< alert title="Note" color="info" >}} @@ -34,15 +40,27 @@ blocking](https://docs.prow.k8s.io/docs/jobs#requiring-jobs-for-auto-merge-throu ## How Do We Set up Branch Protection? +Good news: OpenShift CI handles branch protection automatically for you! + +### Automatic Setup + Branch Protection requires no manual set-up for repositories with presubmit jobs. When a mandatory presubmit job is added or removed from a repository, automation will ensure that the GitHub Branch Protection settings are updated as well. -Branch protection is configured by a job that runs periodically every six hours. You can see -[here](https://prow.ci.openshift.org/?job=periodic-branch-protector) when it last ran. The implication of this is that +### How It Works + +1. You add a required test to your CI configuration +2. A periodic job (runs every 6 hours) detects the change +3. Branch protection rules are automatically updated in GitHub +4. Your required tests are now enforced before merging + +You can see [here](https://prow.ci.openshift.org/?job=periodic-branch-protector) when the updater last ran. The implication of this is that when you add/remove a mandatory job, it may take up to six hours for this change to show up in GitHub. -The `openshift-merge-robot` that configures the branch protection needs admin permissions. +### Prerequisites + +The `openshift-merge-robot` that configures the branch protection needs admin permissions on your repository. See the [onboarding guide]({{< ref "../how-tos/onboarding-a-new-component" >}}) for details. ## Is It Possible to Disable the Branch Protection for My Repository or Require Jobs That Are Not Managed by Prow? diff --git a/content/en/docs/architecture/ci-operator.md b/content/en/docs/architecture/ci-operator.md index 34e983b3..db15259f 100644 --- a/content/en/docs/architecture/ci-operator.md +++ b/content/en/docs/architecture/ci-operator.md @@ -10,6 +10,20 @@ installed. `ci-operator` hides the complexity of assembling an ephemeral OpenShi of end-to-end test suites to focus on the content of their tests and not the infrastructure required for cluster setup and installation. +## Key Concepts for New Users + +Before diving into the details, let's clarify some important terms: + +- **Release Payload**: A bundle of container images that together form a complete OpenShift installation. Think of it as all the pieces needed to install and run OpenShift. +- **ImageStream**: An OpenShift resource that tracks different versions of container images. It's like a catalog of available image versions. +- **ImageStreamTag**: A specific version within an ImageStream, referenced as `stream:tag` (e.g., `pipeline:src`). +- **Namespace**: An isolated workspace in OpenShift where your CI job runs. Each job gets its own namespace to prevent interference. +- **Build**: An OpenShift process that creates container images from source code. + +For more definitions, see the [Glossary]({{< ref "../getting-started/glossary" >}}). + +## How CI Operator Works + `ci-operator` allows for components that make up an OpenShift release to be tested together by allowing each component repository to test with the latest published versions of all other components. An integration stream of container `images` is maintained with the latest tested versions of every component. A test for any one component snapshots that stream, replaces any `images` that are @@ -36,10 +50,16 @@ to fulfill their intent. When `ci-operator` runs tests to verify proposed changes in a pull request to a component repository, it must first build the output artifacts from the repository. In order to generate these builds, `ci-operator` needs to know the inputs from which they -will be created. A number of inputs can be configured; the following example provides both: +will be created. -* `base_images`: provides a mapping of named `ImageStreamTags` which will be available for use in container image builds -* `build_root`: defines the `ImageStreamTag` in which dependencies exist for building executables and non-image artifacts +Think of inputs as the "ingredients" your CI job needs: +- What container images contain your build tools? +- What base images should your application be built on top of? + +A number of inputs can be configured; the following example provides both: + +* `base_images`: provides a mapping of named images that will be available for use in container image builds (like base operating system images or images with specific tools) +* `build_root`: defines the image that contains all the compilers, tools, and dependencies needed to build your project (like a Go compiler for Go projects) `ci-operator` configuration: @@ -60,8 +80,19 @@ build_root: # declares that the release:golang-1.13 image has the build-time dep tag: "golang-1.13" {{< / highlight >}} +### Understanding the Configuration + +In the example above: +- The `base` image (`ocp/4.5:base`) is a minimal operating system image that other images can be built on top of +- The `cli` image (`ocp/4.5:cli`) contains the OpenShift command-line tools +- The `build_root` image (`openshift/release:golang-1.13`) contains the Go compiler and related tools + +These images are referenced using the format `namespace/name:tag`. The CI system will fetch these specific versions and make them available to your build process. + +### How Image References Work + As `ci-operator` is an OpenShift-native tool, all image references take the form of an `ImageStreamTag` on the build farm cluster, -not just a valid pull-spec for an image. `ci-operator` will import these `ImageStreamTags` into the `Namespace` created for the test +not just a regular container image URL (pull-spec). `ci-operator` will import these `ImageStreamTags` into the `Namespace` created for the test workflow; snapshotting the current state of inputs to allow for reproducible builds. If an image that is required for building is not yet present on the cluster, either: @@ -109,16 +140,22 @@ build_root_image: ## Building Artifacts +Once `ci-operator` has the build environment ready, it needs to actually build your project. This happens in stages, with each stage creating a new container image that builds on the previous one. + +### The Build Pipeline + Starting `FROM` the image described as the `build_root`, `ci-operator` will clone the repository under test and compile artifacts, committing them as image layers that may be referenced in derivative builds. The commands which are run to compile artifacts are configured with `binary_build_commands` and are run in the root of the cloned repository. A separate set of commands, `test_binary_build_commands`, can be configured for building artifacts to support test -execution. The following `ImageStreamTags` are created in the test's `Namespace` +execution. -* `pipeline:root`: imports or builds the `build_root` image -* `pipeline:src`: clones the code under test `FROM pipeline:root` -* `pipeline:bin`: runs commands in the cloned repository to build artifacts `FROM pipeline:src` -* `pipeline:test-bin`: runs a separate set of commands in the cloned repository to build test artifacts `FROM pipeline:src` +Here's the sequence of images that get created (called the "pipeline"): + +* `pipeline:root`: imports or builds the `build_root` image (your build environment) +* `pipeline:src`: clones your repository's code into the `root` image +* `pipeline:bin`: runs your build commands (like `make build`) to create compiled artifacts +* `pipeline:test-bin`: optionally runs different commands to build test-specific artifacts `ci-operator` configuration: @@ -127,8 +164,17 @@ binary_build_commands: "go build ./cmd/..." # these commands are run to test_binary_build_commands: "go test -c -o mytests" # these commands are run to build "pipeline:test-bin" {{< / highlight >}} +### Understanding the Build Flow + +Here's what happens when these commands run: + +1. **Setup**: The `pipeline:root` image (your Go development environment) is prepared +2. **Clone**: Your repository is cloned into this environment, creating `pipeline:src` +3. **Build**: The `binary_build_commands` run (e.g., `go build ./cmd/...`), creating `pipeline:bin` with your compiled binaries +4. **Test Build** (optional): The `test_binary_build_commands` run, creating `pipeline:test-bin` with test executables + The content created with these OpenShift `Builds` is addressable in the `ci-operator` configuration simply with the tag. For -instance, the `pipeline:bin` image can be referenced as `bin` when the content in that image is needed in derivative `Builds`. +instance, the `pipeline:bin` image can be referenced as just `bin` when the content in that image is needed in derivative `Builds`. ### Using the Build Cache @@ -157,9 +203,17 @@ was built off of the same build root image as would otherwise be imported. That Once container `images` exist with output artifacts for a repository, additional output container `images` may be built that make use of those artifacts. Commonly, the desired output container image will contain only the executables for a -component and not any of the build-time dependencies. Furthermore, most teams will need to publish their output -container `images` through the automated release pipeline, which requires that the `images` are built in Red Hat's -production image build system, OSBS. In order to create an output container image without build-time dependencies in a +component and not any of the build-time dependencies. + +### Why Multi-Stage Dockerfiles? + +Your compiled application doesn't need all the build tools (compilers, build scripts, etc.) to run - it just needs the final binary. Multi-stage Dockerfiles let you: +1. Build your application in an image with all the development tools +2. Copy just the compiled binary to a minimal runtime image +3. Ship a smaller, more secure image + +Furthermore, most teams will need to publish their output container `images` through the automated release pipeline, which requires that the `images` are built in Red Hat's +production image build system, OSBS (OpenShift Build Service). In order to create an output container image without build-time dependencies in a manner which is compatible with OSBS, the simplest approach is a multi-stage `Dockerfile` build. The standard pattern for a multi-stage `Dockerfile` is to run a compilation in a builder image and copy the resulting diff --git a/content/en/docs/architecture/step-registry.md b/content/en/docs/architecture/step-registry.md index af4fe69a..af280305 100644 --- a/content/en/docs/architecture/step-registry.md +++ b/content/en/docs/architecture/step-registry.md @@ -8,19 +8,41 @@ These individual steps can be put into a shared registry that other tests can ac upgrade as multiple test workflows can share steps and don’t have to each be updated individually to fix bugs or add new features. It also reduces the chances of a mistake when copying a feature from one test workflow to another. +## Why Multi-Stage Tests? + +Think of multi-stage tests like building with LEGO blocks: +- **Traditional approach**: Write one big script that does everything (install cluster, run tests, collect logs, cleanup) +- **Multi-stage approach**: Use pre-built, tested components that you can mix and match + +Benefits: +- **Reusability**: Someone already wrote the "install AWS cluster" step - just use it! +- **Maintainability**: When AWS installation needs an update, fix it once, and all tests benefit +- **Flexibility**: Mix different cluster types, test suites, and configurations easily + The current step registry is available for browsing [here](https://steps.ci.openshift.org/). +## Building Blocks + To understand how the multistage tests and registry work, we must first talk about the three components of the test registry and how to use those components to create a test: -* [Step](#step): A step is the lowest level component in the test step registry. It describes an individual test step. -* [Chain](#chain): A chain is a registry component that specifies multiple steps to be run. Any item of the chain can be either a step or another chain. -* [Workflow](#workflow): A workflow is the highest level component of the step registry. It contains three chains: pre, test, post. +* [Step](#step): A step is the lowest level component in the test step registry. It describes an individual test step (like "install cluster" or "run e2e tests"). +* [Chain](#chain): A chain is a registry component that specifies multiple steps to be run in sequence. Any item of the chain can be either a step or another chain. +* [Workflow](#workflow): A workflow is the highest level component of the step registry. It defines a complete test scenario with three phases: pre (setup), test (main tests), and post (cleanup). ## Step -A step is the lowest level component in the test registry. A step defines a base container image, the filename of the shell script to run inside the -container, the resource requests and limits for the container, and documentation for the step. Example of a step: +A step is the lowest level component in the test registry. Think of it as a single action in your test process - like "install a cluster" or "run conformance tests" or "collect logs". + +### What Makes Up a Step? + +A step defines: +- **Container image**: Where the step runs (what tools are available) +- **Script**: What commands to execute +- **Resources**: How much CPU/memory the step needs +- **Documentation**: What the step does and how to use it + +Here's an example of a step configuration: {{< highlight yaml >}} ref: diff --git a/content/en/docs/getting-started/_index.md b/content/en/docs/getting-started/_index.md index c1e54554..13cc97a8 100644 --- a/content/en/docs/getting-started/_index.md +++ b/content/en/docs/getting-started/_index.md @@ -2,4 +2,78 @@ title: "Getting Started" linkTitle: "Getting started" weight: 1 +description: > + Start here to learn about OpenShift CI and how to use it effectively --- + +# Getting Started with OpenShift CI + +Welcome to OpenShift CI! This section will guide you through understanding and using our CI platform. + +## Learning Path + +We recommend following this path to get started: + +### 1. [Core Concepts]({{< ref "concepts" >}}) +Start here to understand the fundamental components and terminology of OpenShift CI. This page covers: +- What is Prow and ci-operator? +- How do jobs and steps work? +- What are cluster profiles and the step registry? + +### 2. [Writing Your First Test]({{< ref "writing-first-test" >}}) +A hands-on tutorial that walks you through creating your first CI test, from simple container tests to complex multi-stage workflows. + +### 3. [Simple CI Example]({{< ref "simple-example" >}}) +A complete, annotated example of CI configuration for a Go project with detailed explanations. + +### 4. [Examples]({{< ref "examples" >}}) +Practical examples of common CI configurations: +- Running e2e tests on different cloud providers +- Building and testing container images +- Using shared test components + +### 4. [Useful Links]({{< ref "useful-links" >}}) +Quick reference to important resources: +- CI cluster access and dashboards +- Slack channels and support +- Common tools and services + +### 5. [FAQ]({{< ref "helpdesk-faq" >}}) +Frequently asked questions from the community, automatically updated from our Slack discussions. + +### 6. [Glossary]({{< ref "glossary" >}}) +Definitions of common terms and concepts used throughout OpenShift CI documentation. + +## Quick Start Checklist + +If you're completely new to OpenShift CI: + +- [ ] Read the [Core Concepts]({{< ref "concepts" >}}) page +- [ ] Join `#forum-ocp-testplatform` on Slack +- [ ] Follow the [Writing Your First Test]({{< ref "writing-first-test" >}}) tutorial +- [ ] Explore the [Step Registry](https://steps.ci.openshift.org/) +- [ ] Review existing [CI configurations](https://github.com/openshift/release/tree/master/ci-operator/config) for examples + +## Common Next Steps + +After getting started, you might want to: + +- [Onboard a new component]({{< ref "../how-tos/onboarding-a-new-component" >}}) to OpenShift CI +- [Add your jobs to TestGrid]({{< ref "../how-tos/add-jobs-to-testgrid" >}}) for monitoring +- [Configure notifications]({{< ref "../how-tos/notification" >}}) for test results +- [Set up release gating]({{< ref "../architecture/release-gating" >}}) for your component + +## Getting Help + +When you need assistance: + +1. **Check the documentation** - This site and the linked resources +2. **Search Slack history** - Many questions have been answered before +3. **Ask in Slack** - Use `#forum-ocp-testplatform` for general questions +4. **File a Jira ticket** - For bugs or feature requests (use the Slack workflows) + +## Contributing + +Found an issue with the documentation? Want to add an example? We welcome contributions! +- [Documentation source](https://github.com/openshift/ci-docs) +- [CI configurations](https://github.com/openshift/release) diff --git a/content/en/docs/getting-started/concepts.md b/content/en/docs/getting-started/concepts.md new file mode 100644 index 00000000..b3d2a5dc --- /dev/null +++ b/content/en/docs/getting-started/concepts.md @@ -0,0 +1,100 @@ +--- +title: "Core Concepts" +description: Understanding the fundamental concepts of OpenShift CI +weight: 1 +keywords: ["concepts", "prow", "ci-operator", "basics", "introduction", "getting started", "fundamentals"] +--- + +# Core Concepts + +Welcome to OpenShift CI! Before diving into examples and tutorials, it's important to understand the key concepts that make up our CI platform. + +## Overview + +OpenShift CI is a Kubernetes-native CI/CD system that tests OpenShift components and other projects. It's built on top of [Prow](https://docs.prow.k8s.io/), a Kubernetes-based CI/CD system developed by the Kubernetes community. + +## Key Components + +### Prow +Prow is the foundation of OpenShift CI. It handles: +- **GitHub integration**: Responds to pull requests, issues, and merges +- **Job scheduling**: Decides when and where to run CI jobs +- **Result reporting**: Comments on PRs with test results + +Common Prow interactions you'll see: +- `/retest` - Retriggers failed tests +- `/test ` - Runs a specific test job +- `/hold` - Prevents automatic merging + +### ci-operator +`ci-operator` is the OpenShift-specific component that knows how to: +- Build container images from your source code +- Create ephemeral OpenShift clusters for testing +- Run your tests in a consistent environment +- Clean up resources after tests complete + +Think of it as the "brain" that understands OpenShift's build and test requirements. + +### Jobs and Steps + +**Jobs** are the individual CI tasks that run on your code: +- **Presubmit jobs**: Run on pull requests before merging +- **Postsubmit jobs**: Run after code is merged +- **Periodic jobs**: Run on a schedule + +**Steps** are the building blocks of jobs: +- Each step runs in its own container +- Steps can share data through a shared directory +- Steps are reusable across different jobs + +### The Step Registry + +The [Step Registry](https://steps.ci.openshift.org/) is a library of reusable test components: +- **Steps**: Individual test actions (e.g., "install cluster", "run e2e tests") +- **Chains**: Sequences of steps that are commonly used together +- **Workflows**: Complete test scenarios combining pre, test, and post chains + +### Cluster Profiles + +Cluster profiles define where and how to create test clusters: +- `aws`: Creates clusters on Amazon Web Services +- `gcp`: Creates clusters on Google Cloud Platform +- `azure`: Creates clusters on Microsoft Azure +- And many more... + +Each profile includes the necessary credentials and configuration for that platform. + +## How It All Works Together + +Here's a simplified flow of what happens when you push code to a pull request: + +1. **GitHub** notifies **Prow** about the new commits +2. **Prow** looks up which jobs should run for your repository +3. For each job, **Prow** creates a Pod that runs **ci-operator** +4. **ci-operator** reads your job configuration and: + - Builds any necessary container images + - Sets up test infrastructure (like OpenShift clusters) + - Runs your test steps in order + - Collects logs and artifacts +5. **Prow** reports the results back to your pull request + +## Common Terms You'll Encounter + +- **Release payload**: A set of container images that make up an OpenShift release +- **Promotion**: Publishing your container images for use by other components +- **Rehearsal**: Testing CI configuration changes before they're applied +- **Lease**: Cloud quota reservation for running tests +- **Must-gather**: Diagnostic data collection from failed tests + +## What's Next? + +Now that you understand the basic concepts: +- Check out [Examples]({{< ref "examples" >}}) for common CI configurations +- Learn about [Writing Your First Test]({{< ref "writing-first-test" >}}) +- Explore the [Step Registry](https://steps.ci.openshift.org/) for reusable components + +## Need Help? + +- Join `#forum-ocp-testplatform` on Slack for questions +- Check the [FAQ]({{< ref "helpdesk-faq" >}}) for common issues +- Review the [Troubleshooting Guide]({{< ref "../troubleshooting/_index.md" >}}) for debugging help \ No newline at end of file diff --git a/content/en/docs/getting-started/examples.md b/content/en/docs/getting-started/examples.md index b492ab8b..cc47232d 100644 --- a/content/en/docs/getting-started/examples.md +++ b/content/en/docs/getting-started/examples.md @@ -4,160 +4,474 @@ description: Examples of common tasks in CI configuration. weight: 2 --- -# How do I add a job that runs the OpenShift end-to-end conformance suite on AWS? +# CI Configuration Examples -Use the [`openshift-e2e-aws`](https://steps.ci.openshift.org/workflow/openshift-e2e-aws) workflow and set -`cluster_profile` to `aws`. +This page provides practical examples for common CI configuration scenarios. Each example includes the full configuration and explanation. -{{< highlight yaml >}} -- as: e2e-steps - steps: - cluster_profile: aws - workflow: openshift-e2e-aws -{{< / highlight >}} +## Table of Contents +- [Basic Testing](#basic-testing) +- [End-to-End Testing](#end-to-end-testing) +- [Building and Testing Images](#building-and-testing-images) +- [Cross-Repository Testing](#cross-repository-testing) +- [Advanced Patterns](#advanced-patterns) -# How do I write a simple "Execute this command in a container" test? +## Basic Testing -Use a container test. Container tests are set up to always contain the source code, either by -explicitly cloning it if they use an image that is in `base_images` or implicitly if they reference -a `pipeline` image like `src`. +### Simple Unit Test +Run unit tests without needing a cluster: + +{{< highlight yaml >}} +tests: +- as: unit + commands: | + echo "Running unit tests..." + go test -race -cover -v ./pkg/... + container: + from: src +{{< /highlight >}} + +### Linting and Code Quality +Use external tools for code quality checks: {{< highlight yaml >}} base_images: golangci-lint: namespace: ci name: golangci-lint - tag: v1.37.1 + tag: v1.54.2 + tests: - as: lint - commands: golangci-lint run ./... + commands: | + echo "Running linters..." + golangci-lint run --timeout=10m container: from: golangci-lint - clone: true # Defaults to "true", set to "false" if you do not want your source code to be present. + clone: true # Clone source code into the linter image + +- as: verify-deps + commands: | + echo "Verifying dependencies..." + go mod tidy + go mod vendor + git diff --exit-code + container: + from: src {{< /highlight >}} -# How do I use an image from another repo in my repo’s tests? +### Running Tests in Parallel +Speed up test execution with parallel test suites: + +{{< highlight yaml >}} +tests: +- as: parallel-unit-tests + steps: + test: + - as: test-pkg-api + commands: go test -v ./pkg/api/... + from: src + resources: + requests: + cpu: 2 + memory: 4Gi + - as: test-pkg-controller + commands: go test -v ./pkg/controller/... + from: src + resources: + requests: + cpu: 2 + memory: 4Gi + - as: test-pkg-util + commands: go test -v ./pkg/util/... + from: src + resources: + requests: + cpu: 1 + memory: 2Gi +{{< /highlight >}} -In order to use an image from one repository in the tests of another, it is necessary to first publish the image from -the producer repository and import it in the consumer repository. Generally, a central `ImageStream` is used for -continuous integration; a repository opts into using an integration stream with the `releases.integration` field in the -`ci-operator` configuration and opts into publishing to the stream with the `promotion` field. +## End-to-End Testing -## Publishing an Image For Reuse +### Basic E2E on AWS +{{< highlight yaml >}} +tests: +- as: e2e-aws + steps: + cluster_profile: aws + workflow: openshift-e2e-aws +{{< /highlight >}} -When configuring `ci-operator` for a repository, the `promotion` stanza declares which container `images` are published and -defines the integration `ImageStream` where they will be available. By default, all container `images` declared in the -`images` block of a `ci-operator` configuration are published when a `promotion` stanza is present to define the integration -`ImageStream`. Promotion can be furthermore configured to include other `images`, as well, although promotion should be avoided unless there is an expectation of external consumption. For example, do not publish images with [the `io.openshift.release.operator` label](../../how-tos/onboarding-a-new-component/#product-builds-and-becoming-part-of-an-openshift-release) unless they should be included in OpenShift release images. +### E2E with Custom Test Suite +Run your own e2e tests on a cluster: -In the following `ci-operator` configuration, the following `images` are promoted for reuse by other repositories to the `ocp/4.4` integration `ImageStream`: +{{< highlight yaml >}} +tests: +- as: e2e-custom + steps: + cluster_profile: aws + test: + - as: run-my-tests + commands: | + echo "Running custom e2e tests..." + export KUBECONFIG=${KUBECONFIG} + + # Wait for cluster to be ready + oc wait --for=condition=Ready nodes --all --timeout=300s + + # Run your test suite + make test-e2e + from: src + resources: + requests: + cpu: 100m + memory: 200Mi + workflow: ipi-aws # Handles cluster creation/destruction +{{< /highlight >}} -* the `pipeline:src` tag, published as `ocp/4.4:repo-scripts` containing the latest version of the repository to allow for executing helper scripts. -* the `pipeline:test-bin` tag, published as `ocp/4.4:repo-tests` containing built test binaries to allow for running the repository's tests -* the `stable:component` tag, published as `ocp/4.4:component` containing the component itself to allow for deployments and installations in end-to-end scenarios +### E2E with Operator Deployment +Deploy and test an operator: -`ci-operator` configuration: {{< highlight yaml >}} -test_binary_build_commands: go test -race -c -o e2e-tests # will create the test-bin tag -promotion: - to: - - additional_images: - repo-scripts: src # promotes "src" as "repo-scripts" - repo-tests: test-bin # promotes "test-bin" as "repo-tests" - namespace: ocp - name: 4.4 -images: -- from: ubi8 - to: component # promotes "component" by default - context_dir: images/component -{{< / highlight >}} +tests: +- as: e2e-operator + steps: + cluster_profile: aws + test: + - as: install-operator + commands: | + # Create namespace + oc create namespace my-operator + + # Install CRDs + oc apply -f deploy/crds/ + + # Deploy operator + oc apply -f deploy/ -n my-operator + + # Wait for deployment + oc wait --for=condition=Available \ + deployment/my-operator \ + -n my-operator \ + --timeout=300s + from: src + resources: + requests: + cpu: 100m + memory: 200Mi + - as: test-operator + commands: | + # Create test resources + oc apply -f test/e2e/resources/ + + # Run operator e2e tests + go test -v ./test/e2e/... \ + -kubeconfig=${KUBECONFIG} \ + -namespace=my-operator-test + from: src + resources: + requests: + cpu: 1 + memory: 2Gi + workflow: ipi-aws +{{< /highlight >}} -## Consuming an Image +### Upgrade Testing +Test upgrades between OpenShift versions: -Once a repository is publishing an image for reuse by others, downstream users can configure `ci-operator` to use that -image in tests by including it as a base_image or as part of the `releases`. In general, `images` will be available -as part of the `releases` and explicitly including them as a base_image will only be necessary if the promoting -repository is exposing them to a non-standard `ImageStream`. Regardless of which workflow is used to consume the image, -the resulting tag will be available under the stable `ImageStream`. The following `ci-operator` configuration imports a -number of `images`: +{{< highlight yaml >}} +releases: + initial: + release: + channel: stable + version: "4.13" + latest: + release: + channel: stable + version: "4.14" -* the `stable:custom-scripts` tag, published as `myregistry.com/project/custom-scripts:latest` -* the `stable:component` and `:repo-{scripts|tests}` tags, by virtue of them being published under `ocp/4.4` and brought in with the `releases` +tests: +- as: e2e-upgrade + steps: + cluster_profile: aws + workflow: openshift-upgrade-aws +{{< /highlight >}} + +## Building and Testing Images -`ci-operator` configuration: +### Build Multiple Images {{< highlight yaml >}} base_images: - custom-scripts: - namespace: project - name: custom-scripts + base: + namespace: ocp + name: "4.14" + tag: base + +images: +- from: base + to: controller + dockerfile_path: build/controller.Dockerfile +- from: base + to: webhook + dockerfile_path: build/webhook.Dockerfile +- from: base + to: cli + dockerfile_path: build/cli.Dockerfile + +tests: +- as: verify-images + commands: | + # Test controller image + podman run --rm ${IMAGE_FORMAT}/controller:latest --version + + # Test webhook image + podman run --rm ${IMAGE_FORMAT}/webhook:latest --help + + # Test CLI image + podman run --rm ${IMAGE_FORMAT}/cli:latest version + container: + from: src +{{< /highlight >}} + +### Multi-Stage Build Testing +Test images built with multi-stage Dockerfiles: + +{{< highlight yaml >}} +binary_build_commands: make build + +images: +- dockerfile_literal: | + FROM registry.ci.openshift.org/ocp/builder:rhel-8-golang-1.20-openshift-4.14 AS builder + WORKDIR /go/src/github.com/org/repo + COPY . . + RUN make build + + FROM registry.ci.openshift.org/ocp/4.14:base + COPY --from=builder /go/src/github.com/org/repo/bin/app /usr/bin/ + ENTRYPOINT ["/usr/bin/app"] + from: base + inputs: + bin: + as: + - registry.ci.openshift.org/ocp/builder:rhel-8-golang-1.20-openshift-4.14 + to: my-app + +promotion: + to: + - namespace: my-namespace + name: my-app tag: latest -releases: - latest: - integration: - namespace: ocp - name: 4.4 -{{< / highlight >}} +{{< /highlight >}} -Once the image has been configured to be an input for the repository's tests in the `ci-operator` configuration, either -explicitly as a `base_image` or implicitly as part of the `releases`, it can be used in tests in one of two ways. A -registry step can be written to execute the shared tests in any `ci-operator` configuration, or a literal test step can be -added just to one repository's configuration to run the shared tests. Two examples follow which add an execution of -shared end-to-end tests using these two approaches. Both examples assume that we have the ipi workflow available to use. +## Cross-Repository Testing + +### Using Images from Another Repository +Test with images published by another repository: + +{{< highlight yaml >}} +base_images: + another-component: + namespace: ocp + name: "4.14" + tag: another-component -### Adding a Reusable Test Step +tests: +- as: integration-test + commands: | + # Start the other component + podman run -d --name other ${ANOTHER_COMPONENT_IMAGE} + + # Run integration tests against it + export OTHER_COMPONENT_URL=http://localhost:8080 + make test-integration + + # Cleanup + podman stop other + container: + from: src + dependencies: + - name: another-component + env: ANOTHER_COMPONENT_IMAGE +{{< /highlight >}} -Full directions for adding a new reusable test step can be found in the overview for [new registry -content](/docs/how-tos/adding-changing-step-registry-content/#adding-content). An example of the process is provided -here. First, `make` directory for the test step in the registry: `ci-operator/step-registry/org/repo/e2e`. +### Testing Multiple Repositories Together +Using a shared workflow to test multiple components: -Then, declare a reusable step: `ci-operator/step-registry/org/repo/e2e/org-repo-e2e-ref.yaml` {{< highlight yaml >}} +# In repo A +tests: +- as: cross-repo-test + steps: + cluster_profile: aws + test: + - ref: deploy-component-a + - ref: deploy-component-b # From shared registry + - ref: run-integration-tests + workflow: openshift-e2e-aws + +# In the step registry ref: - as: org-repo-e2e - from: repo-tests - commands: org-repo-e2e-commands.sh + as: deploy-component-b + from: component-b-image + commands: | + oc apply -f https://raw.githubusercontent.com/org/component-b/main/deploy/ + oc wait --for=condition=Available deployment/component-b resources: requests: - cpu: 1000m - memory: 100Mi - documentation: |- - Runs the end-to-end suite published by org/repo. -{{< / highlight >}} + cpu: 100m + memory: 200Mi +{{< /highlight >}} -Finally, populate a command file for the step: `ci-operator/step-registry/org/repo/e2e/org-repo-e2e-commands.sh` -{{< highlight bash >}} -#!/bin/bash -e2e-tests # as built by go test -c -{{< / highlight >}} +## Advanced Patterns -Now the test step is ready for use by any repository. To `make` use of it, update `ci-operator` configuration for a separate -repository under `ci-operator/config/org/other/org-other-master.yaml`: +### Conditional Testing Based on Changes +Only run expensive tests when relevant files change: {{< highlight yaml >}} -- as: org-repo-e2e +tests: +# Always run unit tests +- as: unit + commands: make test-unit + container: + from: src + +# Only run integration tests if source code changes +- as: integration + run_if_changed: "^(pkg|cmd)/" + commands: make test-integration + container: + from: src + +# Only run e2e if APIs or controllers change +- as: e2e-aws + optional: true # Don't block merge + run_if_changed: "^(api|controllers|deploy)/" steps: cluster_profile: aws - workflow: ipi + workflow: openshift-e2e-aws +{{< /highlight >}} + +### Testing with External Services +Using secrets to test with external services: + +{{< highlight yaml >}} +tests: +- as: external-integration + steps: test: - - ref: org-repo-e2e -{{< / highlight >}} + - as: test-with-database + credentials: + - namespace: test-credentials + name: database-credentials + mount_path: /var/run/secrets/database + commands: | + # Read credentials + export DB_HOST=$(cat /var/run/secrets/database/host) + export DB_USER=$(cat /var/run/secrets/database/user) + export DB_PASS=$(cat /var/run/secrets/database/password) + + # Run tests against external database + make test-database-integration + from: src + resources: + requests: + cpu: 500m + memory: 1Gi +{{< /highlight >}} + +### Platform-Specific Testing +Run tests on multiple cloud providers: + +{{< highlight yaml >}} +tests: +- as: e2e-aws + steps: + cluster_profile: aws + env: + FEATURE_SET: TechPreviewNoUpgrade + workflow: openshift-e2e-aws + +- as: e2e-gcp + steps: + cluster_profile: gcp + env: + FEATURE_SET: TechPreviewNoUpgrade + workflow: openshift-e2e-gcp -### Adding a Literal Test Step +- as: e2e-azure + steps: + cluster_profile: azure4 + env: + FEATURE_SET: TechPreviewNoUpgrade + workflow: openshift-e2e-azure +{{< /highlight >}} + +### Periodic Regression Testing +Set up nightly tests with notifications: -`ci-operator` configuration: {{< highlight yaml >}} -- as: repo-e2e +tests: +- as: nightly-regression + cron: "0 2 * * *" # 2 AM UTC daily steps: cluster_profile: aws - workflow: ipi test: - - as: e2e - from: repo-tests - commands: |- - #!/bin/bash - e2e-tests # as built by go test -c + - as: deploy-latest + commands: | + # Deploy latest development version + oc apply -f https://raw.githubusercontent.com/org/repo/main/deploy/dev/ + from: src + - as: run-full-suite + commands: | + # Run comprehensive test suite + make test-regression + from: src + timeout: 3h0m0s + post: + - as: send-results + commands: | + # Process and send results (if using a notification system) + if [ -f ${ARTIFACT_DIR}/junit.xml ]; then + python3 scripts/send-test-report.py \ + --junit ${ARTIFACT_DIR}/junit.xml \ + --webhook ${SLACK_WEBHOOK} + fi + from: src + workflow: ipi-aws +{{< /highlight >}} + +### Using Cluster Pools for Faster Tests +Instead of provisioning a new cluster: + +{{< highlight yaml >}} +tests: +- as: e2e-cluster-pool + cluster_claim: + architecture: amd64 + cloud: aws + owner: openshift-ci + product: ocp + timeout: 1h0m0s + version: "4.14" + steps: + test: + - as: test + commands: | + # Cluster is already provisioned + oc get nodes + make test-e2e + from: src resources: requests: - cpu: 1000m - memory: 2Gi -{{< / highlight >}} + cpu: 2 + memory: 8Gi + workflow: generic-claim +{{< /highlight >}} + +## Next Steps + +- Review the [Step Registry](https://steps.ci.openshift.org/) for reusable components +- Check [Architecture documentation](/docs/architecture/) for deeper understanding +- See [How-To guides](/docs/how-tos/) for specific tasks +- Use the [Troubleshooting guide](/docs/troubleshooting/) when things go wrong + +Remember: Start simple, test locally when possible, and gradually add complexity as needed! diff --git a/content/en/docs/getting-started/glossary.md b/content/en/docs/getting-started/glossary.md new file mode 100644 index 00000000..ca6642d1 --- /dev/null +++ b/content/en/docs/getting-started/glossary.md @@ -0,0 +1,201 @@ +--- +title: "Glossary" +description: Common terms and concepts used in OpenShift CI +weight: 5 +keywords: ["glossary", "terms", "definitions", "vocabulary", "concepts", "terminology"] +--- + +# OpenShift CI Glossary + +This glossary defines common terms you'll encounter when working with OpenShift CI. Terms are organized alphabetically. + +## A + +### Artifact +Files or logs produced by a test job that are saved for later inspection. These might include test results, screenshots, cluster logs, or any other debugging information. + +### Approval (`/approve`) +A Prow command used in pull requests to indicate that the code changes have been reviewed and approved. Requires appropriate permissions defined in OWNERS files. + +## B + +### Base Image +A container image used as a starting point for building other images. In CI, these are often language-specific images (like `golang:1.19`) that contain build tools. + +### Branch Protection +GitHub settings that prevent direct pushes to important branches and enforce that certain CI checks pass before merging. OpenShift CI automatically manages these settings based on your job configuration. + +### Build Farm +The OpenShift clusters where CI jobs run. These clusters provide the compute resources and infrastructure needed to execute tests. + +### Build Root +The container image that contains all the tools and dependencies needed to compile your project. This is where commands like `make build` or `go build` are executed. + +## C + +### Chain +In the step registry, a sequence of steps that run in order. Chains can be reused across different workflows, making it easy to share common sequences like "install a cluster" or "run conformance tests". + +### CI Operator (`ci-operator`) +The core component of OpenShift CI that understands how to build, test, and promote OpenShift components. It handles tasks like building images, creating test clusters, and running tests. + +### Cincinnati +Red Hat's update service that manages OpenShift release channels and determines which versions users can upgrade to. CI can test with releases from Cincinnati. + +### Cluster Profile +A predefined configuration that specifies how to provision test clusters on different cloud platforms (AWS, GCP, Azure, etc.). Each profile includes the necessary credentials and settings. + +### Cluster Pool +A set of pre-installed OpenShift clusters that tests can claim instead of installing a new cluster each time. This significantly speeds up test execution. + +## D + +### Dependency +In multi-stage tests, a reference to an image or artifact that a step needs. Dependencies ensure that required resources are available before a step runs. + +## E + +### Ephemeral Cluster +A temporary OpenShift cluster created just for running tests. These clusters are automatically destroyed after the tests complete. + +### Ephemeral Release +A custom OpenShift release payload created during testing that includes the images built from your pull request. This allows testing changes before they're merged. + +## G + +### GitHub App +Automated GitHub integrations used by OpenShift CI. Two main apps are used: "OpenShift CI" for running tests and "OpenShift Merge Bot" for merging approved PRs. + +## I + +### ImageStream +An OpenShift/Kubernetes resource that tracks multiple versions (tags) of related container images. Think of it as a named collection of image versions. + +### ImageStreamTag +A specific version of an image within an ImageStream. Written as `stream:tag` (e.g., `pipeline:src`). This is how images are referenced in OpenShift. + +### Integration Stream +A continuously updated set of images representing the latest tested versions of all OpenShift components. Used to ensure components work together. + +## J + +### Job +A CI task that runs in response to events like opening a PR, merging code, or on a schedule. Jobs execute the tests and builds you've configured. + +## L + +### Lease +A reservation for cloud resources (like AWS quota) that ensures your test has the necessary infrastructure to create clusters. + +### Linting +Automated code style and quality checks. Common linters include `golangci-lint` for Go code and `shellcheck` for shell scripts. + +## M + +### Multi-stage Test +A test composed of multiple steps that can share data and run in sequence. This is the preferred way to write complex tests in OpenShift CI. + +### Must-gather +A tool that collects diagnostic information from OpenShift clusters. CI automatically runs this when tests fail to help with debugging. + +## N + +### Namespace +An OpenShift/Kubernetes concept for isolating resources. Each CI job runs in its own namespace to prevent interference between tests. + +## O + +### OSBS +OpenShift Build Service - Red Hat's production system for building container images that are shipped to customers. + +### OWNERS File +A file in your repository that defines who can approve changes to different parts of the code. Used by Prow for automatic review assignment and approval permissions. + +## P + +### Periodic Job +A CI job that runs on a schedule (like a cron job) rather than being triggered by code changes. Useful for nightly tests or regular health checks. + +### Pipeline +In CI context, this typically refers to the series of images created during a build: `pipeline:root` β†’ `pipeline:src` β†’ `pipeline:bin` β†’ your output images. + +### Postsubmit Job +A CI job that runs after code is merged. Often used for building and publishing official images or updating documentation. + +### Presubmit Job +A CI job that runs on pull requests before merging. These are your main quality gates that ensure code changes don't break things. + +### Promotion +The process of publishing tested container images to a registry where other components can use them. Images are only promoted after passing all tests. + +### Prow +The Kubernetes-native CI/CD system that OpenShift CI is built on. Prow handles GitHub integration, job scheduling, and reporting results. + +### Pull Secret +Credentials needed to pull container images from private registries. CI provides these automatically for common registries like `registry.redhat.io`. + +### Pull Specification (pull-spec) +The full address of a container image, including registry, namespace, name, and tag or digest. Example: `quay.io/openshift/origin-tests:4.13`. + +## R + +### RBAC +Role-Based Access Control - The permission system used in Kubernetes/OpenShift to control who can perform what actions. + +### Rehearsal +A test run of CI configuration changes to ensure they work correctly before being applied to all pull requests. + +### Release Payload +A bundle of container images that together form a complete OpenShift release. CI tests often install clusters using these payloads. + +### Retest (`/retest`) +A Prow command to re-run failed CI jobs on a pull request. Useful when failures are due to infrastructure issues rather than code problems. + +## S + +### Semantic Versioning (SemVer) +A version numbering scheme (MAJOR.MINOR.PATCH) used by OpenShift. Example: 4.13.0, where 4 is major, 13 is minor, and 0 is patch. + +### Shared Directory (`$SHARED_DIR`) +A filesystem location where test steps can write files that other steps in the same job can read. Used for passing data between steps. + +### Step +The smallest unit of a multi-stage test. Each step runs in its own container and performs a specific task. + +### Step Registry +A library of reusable test components (steps, chains, workflows) that can be combined to create new tests. Browse it at [steps.ci.openshift.org](https://steps.ci.openshift.org/). + +## T + +### Tag +1. In Git: A named reference to a specific commit +2. In container images: A version identifier for an image (like `latest` or `v1.2.3`) +3. In CI: Often refers to ImageStreamTags + +### Test Step +See "Step" above. + +### Tide +The Prow component that automatically merges approved pull requests when all tests pass. It handles the merge queue and ensures master/main branches stay green. + +## W + +### Workflow +In the step registry, a complete test definition consisting of three phases: pre (setup), test (main tests), and post (cleanup). Workflows are composed of chains and steps. + +## Common Abbreviations + +- **CI**: Continuous Integration +- **CD**: Continuous Delivery/Deployment +- **DPP**: Developer Productivity and Productivity (Red Hat team) +- **e2e**: End-to-end (tests that exercise the full system) +- **OCP**: OpenShift Container Platform +- **OKD**: The community distribution of Kubernetes that powers OpenShift +- **PR**: Pull Request +- **QE**: Quality Engineering + +## Getting More Help + +- Can't find a term? Ask in `#forum-ocp-testplatform` on Slack +- Want to add a term? [Submit a PR](https://github.com/openshift/ci-docs) to this glossary +- Need more context? Check the [Core Concepts]({{< ref "concepts" >}}) guide \ No newline at end of file diff --git a/content/en/docs/getting-started/helpdesk-faq.md b/content/en/docs/getting-started/helpdesk-faq.md index 0c8e24f4..aa839dca 100644 --- a/content/en/docs/getting-started/helpdesk-faq.md +++ b/content/en/docs/getting-started/helpdesk-faq.md @@ -1,10 +1,200 @@ --- title: "Helpdesk FAQ" description: FAQ from forum-ocp-testplatform on slack -weight: 1 +weight: 4 +keywords: ["faq", "questions", "help", "common issues", "frequently asked", "answers"] --- -## #forum-ocp-testplatform FAQ +# Frequently Asked Questions + +This page contains answers to the most common questions asked in `#forum-ocp-testplatform` on Slack. + +## General Questions + +### Q: How do I get started with OpenShift CI? +**A:** Start with our [Core Concepts]({{< ref "concepts" >}}) guide, then follow the [Writing Your First Test]({{< ref "writing-first-test" >}}) tutorial. For adding a new repository, see [Onboarding a New Component]({{< ref "../how-tos/onboarding-a-new-component" >}}). + +### Q: Where can I find my job logs? +**A:** Click on the job status in your GitHub PR. This will take you to the Prow UI where you can view logs and download artifacts. Logs are kept for 7 days for PR jobs and 30 days for periodic jobs. + +### Q: What's the difference between ci-operator and Prow? +**A:** Prow handles GitHub integration and job scheduling, while ci-operator knows how to build and test OpenShift components. Think of Prow as the conductor and ci-operator as the musician. See [Core Concepts]({{< ref "concepts" >}}) for details. + +## Job Configuration + +### Q: Why isn't my job triggering on PRs? +**A:** Common causes: +1. Job not generated - run `make jobs` in openshift/release +2. Wrong branch configuration - check file naming matches your branch +3. Job is optional - use `/test job-name` to trigger +See [Job Execution Issues]({{< ref "../troubleshooting/job-execution-issues" >}}) for more. + +### Q: How do I make my job required/optional? +**A:** Set `optional: true` in your test configuration: +```yaml +tests: +- as: my-test + optional: true # Job won't block merge +``` + +### Q: How do I run tests only when certain files change? +**A:** Use `run_if_changed`: +```yaml +tests: +- as: frontend-tests + run_if_changed: "^frontend/" # Regex pattern +``` + +### Q: What's the default timeout for jobs? +**A:** 4 hours. You can override it: +```yaml +tests: +- as: long-test + timeout: 6h0m0s +``` + +## Cluster and Resources + +### Q: Why am I getting "no available quota" errors? +**A:** Cloud quota is exhausted. Solutions: +- Wait 1-2 hours for quota to be released +- Use a different cluster profile (aws-2 instead of aws) +- Use cluster pools instead of provisioning new clusters +See [Cluster Problems]({{< ref "../troubleshooting/cluster-problems" >}}). + +### Q: How do I test on a specific OpenShift version? +**A:** Configure the `releases` section: +```yaml +releases: + latest: + release: + channel: stable + version: "4.14" +``` + +### Q: What cluster profiles are available? +**A:** Common profiles include: +- `aws` - Amazon Web Services +- `gcp` - Google Cloud Platform +- `azure` - Microsoft Azure +- `vsphere` - VMware vSphere +See [full list](https://github.com/openshift/release/tree/master/ci-operator/step-registry/cluster-profiles). + +## Secrets and Access + +### Q: How do I add a secret to my CI job? +**A:** +1. Create secret in Vault at selfservice.vault.ci.openshift.org +2. Add sync metadata: + - `secretsync/target-namespace: "test-credentials"` + - `secretsync/target-name: "my-secret"` +3. Wait 30 minutes for sync +See [Adding a New Secret]({{< ref "../how-tos/adding-a-new-secret-to-ci" >}}). + +### Q: Why can't I see my job in the Prow UI? +**A:** Check if: +- You're logged in with SSO +- You're a member of the repository's GitHub organization +- The job isn't configured for private deck + +### Q: How do I access the cluster created by my test? +**A:** Use the `$KUBECONFIG` environment variable: +```bash +oc --kubeconfig=$KUBECONFIG get nodes +``` + +## Debugging + +### Q: My test is failing, how do I debug it? +**A:** +1. Check the build log in Prow UI +2. Look at artifacts (must-gather, test logs) +3. Add debug output to your test +4. Use `/retest` to retry +See [Debugging Failed Jobs]({{< ref "../troubleshooting/debugging-failed-jobs" >}}). + +### Q: How do I SSH into a running test pod? +**A:** You generally cannot SSH directly. Instead: +- Add a long sleep to your test +- Use `oc debug` while the pod is running +- Save debug information to `${ARTIFACT_DIR}` + +### Q: What is "rehearsal" and why is it failing? +**A:** Rehearsals test your CI configuration changes before they're merged. Failures usually mean: +- Breaking changes to existing jobs +- Invalid configuration +- Renaming jobs (breaks TestGrid) +Run `make rehearse` locally to test. + +## Common Errors + +### Q: "could not resolve release payload" +**A:** Your release configuration is incorrect. Check: +- Release name and version +- Integration namespace configuration +See [Configuration Issues]({{< ref "../troubleshooting/configuration-issues#release-resolution-errors" >}}). + +### Q: "unable to import image" +**A:** The base image doesn't exist or isn't accessible. Verify: +- Image name and tag are correct +- Image exists in the specified namespace +- You have permission to access it + +### Q: "Process did not finish before 4h0m0s timeout" +**A:** Your test exceeded the time limit. Either: +- Increase the timeout +- Optimize your test +- Split into smaller tests +See [timeout configuration]({{< ref "../troubleshooting/debugging-failed-jobs#timeout-failures" >}}). + +## Best Practices + +### Q: Should I use container tests or multi-stage tests? +**A:** +- **Container tests**: Simple commands, unit tests, linting +- **Multi-stage tests**: Need cluster, complex setup, e2e tests + +### Q: How often should periodic jobs run? +**A:** Depends on the purpose: +- Critical paths: Every 4-6 hours +- Standard e2e: Daily +- Expensive tests: Weekly +Consider cost and value of the signal. + +### Q: Should my job be blocking or informing? +**A:** Start with informing. Only make it blocking after: +- Proven stability (>95% pass rate) +- Critical functionality coverage +- Low false-positive rate +See [Release Gating]({{< ref "../architecture/release-gating" >}}). + +## Getting Help + +### Q: Where should I ask for help? +**A:** +1. Check this FAQ and [Troubleshooting Guide]({{< ref "../troubleshooting/_index.md" >}}) +2. Search Slack history in `#forum-ocp-testplatform` +3. Use "Ask a Question" workflow in Slack +4. File a Jira ticket for bugs/features + +### Q: How do I report a CI outage? +**A:** Use the "Report CI Outage" workflow in `#forum-ocp-testplatform` Slack channel. Include: +- Affected jobs/repos +- Error messages +- When it started +- Impact description + +### Q: Who maintains the CI system? +**A:** The Developer Productivity and Test Platform (DPTP) team. Contact via: +- Slack: `#forum-ocp-testplatform` +- Jira: [DPTP project](https://issues.redhat.com/projects/DPTP) + +--- + +## Live FAQ Table + +The table below shows recent questions from Slack. Click on a row for details: + {{< rawhtml >}} diff --git a/content/en/docs/getting-started/quick-reference.md b/content/en/docs/getting-started/quick-reference.md new file mode 100644 index 00000000..9b637498 --- /dev/null +++ b/content/en/docs/getting-started/quick-reference.md @@ -0,0 +1,379 @@ +--- +title: "Quick Reference" +description: Common commands, patterns, and snippets for OpenShift CI +weight: 5 +--- + +# Quick Reference + +This page provides quick access to commonly used commands, configurations, and patterns in OpenShift CI. + +## Prow Commands + +### PR Management +```bash +# Trigger tests +/retest # Rerun all failed required tests +/test # Run specific test +/test all # Run all tests +/retest-required # Rerun only failed required tests + +# PR control +/hold # Prevent auto-merge +/hold cancel # Remove hold +/lgtm # Approve (for reviewers) +/lgtm cancel # Remove approval +/approve # Approve (for approvers) +/approve cancel # Remove approval + +# Skip tests +/skip-test # Skip specific test +/override # Override failing required test +``` + +### Labels +```bash +# Common labels +/label needs-rebase +/label do-not-merge/work-in-progress +/remove-label needs-rebase + +# Jira integration +/jira refresh # Refresh Jira validation +``` + +## CI Configuration Snippets + +### Basic Container Test +```yaml +tests: +- as: unit-test + commands: make test + container: + from: src +``` + +### Multi-stage Test with Cluster +```yaml +tests: +- as: e2e-aws + steps: + cluster_profile: aws + workflow: openshift-e2e-aws +``` + +### Custom Multi-stage Test +```yaml +tests: +- as: my-e2e-test + steps: + cluster_profile: aws + pre: + - as: setup + commands: | + echo "Setting up test environment" + from: src + resources: + requests: + cpu: 100m + memory: 200Mi + test: + - as: test + commands: make e2e-test + from: src + resources: + requests: + cpu: 1000m + memory: 2Gi + post: + - as: cleanup + commands: | + echo "Cleaning up" + from: src + workflow: ipi-aws +``` + +### Conditional Test Execution +```yaml +tests: +- as: frontend-test + run_if_changed: "^(frontend|web)/" + commands: npm test + container: + from: node +``` + +### Optional Test +```yaml +tests: +- as: expensive-test + optional: true + commands: make slow-test + container: + from: src +``` + +### Periodic Test +```yaml +tests: +- as: nightly-e2e + cron: "0 0 * * *" # Daily at midnight UTC + steps: + cluster_profile: aws + workflow: openshift-e2e-aws +``` + +## Release Configuration + +### Integration Release +```yaml +releases: + latest: + integration: + namespace: ocp + name: "4.15" +``` + +### Stable Release +```yaml +releases: + latest: + release: + channel: stable + version: "4.14" +``` + +### Multiple Releases (for upgrade tests) +```yaml +releases: + initial: + release: + channel: stable + version: "4.13" + latest: + release: + channel: stable + version: "4.14" +``` + +## Image Configuration + +### Base Images +```yaml +base_images: + cli: + namespace: ocp + name: "4.15" + tag: cli + upi-installer: + namespace: ocp + name: "4.15" + tag: upi-installer +``` + +### Building Images +```yaml +images: +- from: base + to: my-app + dockerfile_path: Dockerfile +``` + +### Image Promotion +```yaml +promotion: + to: + - namespace: my-namespace + name: my-app + tag: latest +``` + +## Common Environment Variables + +### In Test Steps +```bash +${ARTIFACT_DIR} # Where to save test artifacts +${SHARED_DIR} # Share data between steps +${KUBECONFIG} # Cluster access (after install) +${CLUSTER_PROFILE_DIR} # Cloud credentials location +${RELEASE_IMAGE_LATEST} # Latest release image +${CLUSTER_NAME} # Name of test cluster +${NAMESPACE} # Test namespace +${JOB_NAME_SAFE} # Safe job name for resources +``` + +### For Multi-arch +```bash +${GOARCH} # Architecture (amd64, arm64, etc) +${IMAGE_FORMAT} # Registry format string +``` + +## Resource Management + +### Resource Requests +```yaml +resources: + requests: + cpu: 100m + memory: 200Mi + limits: + memory: 4Gi +``` + +### Timeout Configuration +```yaml +tests: +- as: long-test + timeout: 6h0m0s # Job timeout + steps: + test: + - as: test-step + timeout: 2h0m0s # Step timeout + grace_period: 30s +``` + +## Secrets and Credentials + +### Mount Secret in Step +```yaml +steps: + test: + - as: use-secret + credentials: + - namespace: test-credentials + name: my-secret + mount_path: /var/run/secrets/my-secret + commands: | + export TOKEN=$(cat /var/run/secrets/my-secret/token) +``` + +### Cluster Profile Credentials +```yaml +# AWS credentials at: +${CLUSTER_PROFILE_DIR}/.awscred + +# GCP credentials at: +${CLUSTER_PROFILE_DIR}/gce.json + +# Pull secret at: +${CLUSTER_PROFILE_DIR}/pull-secret +``` + +## Debugging Commands + +### In Test Steps +```bash +# Save debugging info +env | sort > ${ARTIFACT_DIR}/environment.txt +oc get pods --all-namespaces > ${ARTIFACT_DIR}/pods.txt +oc get nodes -o yaml > ${ARTIFACT_DIR}/nodes.yaml + +# Cluster info +oc version +oc get clusterversion +oc get clusteroperators + +# Must-gather +oc adm must-gather --dest-dir=${ARTIFACT_DIR}/must-gather +``` + +### Common Debugging Patterns +```bash +# Fail gracefully with debugging +command || { + echo "Command failed, gathering debug info..." + oc get pods -n my-namespace + oc logs deployment/my-app -n my-namespace + exit 1 +} + +# Wait for condition +oc wait --for=condition=Available deployment/my-app \ + --timeout=300s -n my-namespace +``` + +## Make Targets (in openshift/release) + +```bash +# Validate configuration +make validate-config + +# Generate jobs +make jobs + +# Update job configuration +make update + +# Run rehearsals +make rehearse + +# Specific config +make WHAT=my-repo CONFIG=my-config jobs +``` + +## Common Patterns + +### Wait for Operator +```bash +# Wait for operator deployment +oc wait --for=condition=Available \ + deployment/my-operator \ + -n my-operator-namespace \ + --timeout=10m + +# Wait for CRD +timeout 300s bash -c 'until oc get crd my-crds.example.com; do sleep 5; done' +``` + +### Create and Wait for Resource +```bash +# Apply configuration +oc apply -f config/ + +# Wait for rollout +oc rollout status deployment/my-app -n my-namespace + +# Verify pods are running +oc get pods -n my-namespace | grep Running +``` + +### Retry Pattern +```bash +# Retry command up to 5 times +for i in {1..5}; do + command && break || sleep 30 +done +``` + +### Check Resource Existence +```bash +# Check if namespace exists +if oc get namespace my-namespace 2>/dev/null; then + echo "Namespace exists" +else + oc create namespace my-namespace +fi +``` + +## Useful Links + +### Dashboards +- [Prow Status](https://prow.ci.openshift.org/) +- [Step Registry](https://steps.ci.openshift.org/) +- [TestGrid](https://testgrid.k8s.io/redhat) +- [Sippy](https://sippy.dptools.openshift.org/) + +### Documentation +- [This Site](/) +- [Prow Commands](https://prow.ci.openshift.org/command-help) +- [CI Search](https://search.ci.openshift.org/) + +### Repositories +- [openshift/release](https://github.com/openshift/release) - CI configs +- [openshift/ci-tools](https://github.com/openshift/ci-tools) - CI tooling + +## Need More? + +- Full command documentation: `/docs/architecture/` +- Troubleshooting: `/docs/troubleshooting/` +- Ask in Slack: `#forum-ocp-testplatform` \ No newline at end of file diff --git a/content/en/docs/getting-started/simple-example.md b/content/en/docs/getting-started/simple-example.md new file mode 100644 index 00000000..47255813 --- /dev/null +++ b/content/en/docs/getting-started/simple-example.md @@ -0,0 +1,176 @@ +--- +title: "Simple CI Example" +description: A concrete example of setting up basic CI for a Go project +weight: 3 +keywords: ["example", "tutorial", "go", "golang", "simple", "basic"] +--- + +# Simple CI Example: Testing a Go Project + +This example shows how to set up basic CI for a Go project. We'll create a configuration that: +- Runs unit tests on every pull request +- Builds your Go binary +- Runs linting checks + +## The Configuration File + +Here's a complete `ci-operator` configuration for a simple Go project: + +```yaml +# This file would go in: ci-operator/config/myorg/myrepo/myorg-myrepo-main.yaml + +# Define base images - these are the starting points for our builds +base_images: + os: + name: ubi-minimal # Red Hat Universal Base Image (minimal version) + namespace: ocp + tag: "8" + +# Define the build environment - where compilation happens +build_root: + image_stream_tag: + name: release + namespace: openshift + tag: golang-1.19 # Image with Go 1.19 compiler and tools + +# Define how to build the binary +binary_build_commands: | + go mod download # Download dependencies + go build ./cmd/... # Build all commands in cmd/ directory + +# Define container images to build +images: +- dockerfile_path: Dockerfile # Path to Dockerfile in your repo + to: myapp # Name of the resulting image + +# Define the tests to run +tests: +# Unit tests - run on every PR +- as: unit # Test name (shows in GitHub) + commands: | + go test ./... -race # Run all tests with race detection + container: + from: src # Run in the source image + +# Lint checks - run on every PR +- as: lint + commands: | + golangci-lint run ./... # Run linting + container: + from: src + +# Verify go mod is tidy +- as: verify + commands: | + go mod tidy + git diff --exit-code go.mod go.sum + container: + from: src +``` + +## What Each Section Does + +### Base Images +```yaml +base_images: + os: + name: ubi-minimal + namespace: ocp + tag: "8" +``` +This defines a base operating system image that your application will run on. Think of it as the foundation layer. + +### Build Root +```yaml +build_root: + image_stream_tag: + name: release + namespace: openshift + tag: golang-1.19 +``` +This specifies the image containing your build tools (Go compiler, etc.). The CI system will use this to compile your code. + +### Build Commands +```yaml +binary_build_commands: | + go mod download + go build ./cmd/... +``` +These commands run inside the build root to compile your application. The compiled binaries are saved for use in later stages. + +### Container Images +```yaml +images: +- dockerfile_path: Dockerfile + to: myapp +``` +This tells CI to build a container image using your Dockerfile. The resulting image will be tagged as `myapp`. + +### Tests +Each test runs in its own container and reports pass/fail status to your PR: + +- **unit**: Runs `go test` with race detection enabled +- **lint**: Runs `golangci-lint` to check code style +- **verify**: Ensures `go.mod` is properly maintained + +## Sample Dockerfile + +Your repository would include a `Dockerfile` like this: + +```dockerfile +# Multi-stage build +FROM registry.ci.openshift.org/openshift/release:golang-1.19 AS builder + +WORKDIR /go/src/github.com/myorg/myrepo +COPY . . +RUN go build -o myapp ./cmd/myapp + +# Final image +FROM registry.access.redhat.com/ubi8/ubi-minimal:latest +COPY --from=builder /go/src/github.com/myorg/myrepo/myapp /usr/bin/ +ENTRYPOINT ["/usr/bin/myapp"] +``` + +## How It Works in Practice + +1. **You push code** to a pull request +2. **Prow detects** the change and triggers ci-operator +3. **ci-operator**: + - Sets up a temporary namespace + - Imports the build root image (golang-1.19) + - Clones your code + - Runs your build commands + - Executes each test in parallel + - Builds the container image +4. **Results appear** as status checks on your PR + +## Common Customizations + +### Add Integration Tests +```yaml +- as: integration + commands: | + make test-integration + container: + from: src +``` + +### Test Multiple Go Versions +Create additional config files with different build roots: +- `myorg-myrepo-main.yaml` (Go 1.19) +- `myorg-myrepo-main__go1.20.yaml` (Go 1.20) + +### Add Security Scanning +```yaml +- as: security-scan + commands: | + go list -json -deps ./... | nancy sleuth + container: + from: src +``` + +## Next Steps + +- See [Multi-Stage Tests]({{< ref "../architecture/step-registry" >}}) for more complex scenarios +- Check [Examples]({{< ref "examples" >}}) for cloud-specific tests +- Browse the [Step Registry](https://steps.ci.openshift.org/) for reusable components \ No newline at end of file diff --git a/content/en/docs/getting-started/writing-first-test.md b/content/en/docs/getting-started/writing-first-test.md new file mode 100644 index 00000000..19a7031f --- /dev/null +++ b/content/en/docs/getting-started/writing-first-test.md @@ -0,0 +1,216 @@ +--- +title: "Writing Your First Test" +description: A step-by-step guide to creating your first CI test +weight: 3 +--- + +# Writing Your First Test + +This guide will walk you through creating your first test in OpenShift CI. We'll start simple and gradually add more features. + +## Prerequisites + +Before starting, make sure you have: +- A repository in the `openshift` GitHub organization (or one that's been onboarded to OpenShift CI) +- Basic understanding of YAML +- Familiarity with the [Core Concepts]({{< ref "concepts" >}}) + +## Step 1: Create a Simple Container Test + +Let's start with the simplest type of test - running a command in a container. + +Create a file named `.ci-operator.yaml` in your repository root: + +```yaml +tests: +- as: my-first-test # Name of your test + commands: | # Commands to run + echo "Hello from OpenShift CI!" + go test ./... # Example: run Go tests + container: + from: src # Use the source code image +``` + +This test will: +1. Clone your repository +2. Run the specified commands +3. Report success or failure + +## Step 2: Understanding the Test Environment + +When your test runs, it has access to several things: + +### Environment Variables +- `${SHARED_DIR}` - Share files between test steps +- `${ARTIFACT_DIR}` - Store test artifacts and logs + +### Example: Saving Test Results +```yaml +tests: +- as: test-with-artifacts + commands: | + # Run tests and save results + go test -v ./... | tee ${ARTIFACT_DIR}/test-results.txt + + # Save coverage report + go test -coverprofile=${ARTIFACT_DIR}/coverage.out ./... + container: + from: src +``` + +## Step 3: Using Base Images + +If your test needs specific tools, you can use different base images: + +```yaml +base_images: + golangci-lint: # Define a base image + namespace: ci + name: golangci-lint + tag: v1.54.2 + +tests: +- as: lint + commands: golangci-lint run ./... + container: + from: golangci-lint # Use the defined base image +``` + +## Step 4: Creating a Multi-Stage Test + +For more complex scenarios, use multi-stage tests with the step registry: + +```yaml +tests: +- as: e2e-aws + steps: + cluster_profile: aws # Use AWS credentials + workflow: openshift-e2e-aws # Use a pre-defined workflow +``` + +This will: +1. Create an OpenShift cluster on AWS +2. Run the default OpenShift e2e tests +3. Collect logs and tear down the cluster + +## Step 5: Customizing Multi-Stage Tests + +You can customize workflows by adding your own test steps: + +```yaml +tests: +- as: e2e-my-operator + steps: + cluster_profile: aws + test: + - as: deploy-operator + commands: | + # Deploy your operator + oc create -f deploy/ + + # Wait for rollout + oc wait --for=condition=Available deployment/my-operator + from: src + resources: + requests: + cpu: 100m + memory: 200Mi + - as: run-tests + commands: | + # Run operator tests + make test-e2e + from: src + resources: + requests: + cpu: 1000m + memory: 2Gi + workflow: ipi-aws # Install, test, and deprovision cluster +``` + +## Step 6: Adding Your Test to CI + +Once you've defined your test, you need to: + +1. **Create the CI configuration**: Place your config in the `openshift/release` repository at: + ``` + ci-operator/config///--.yaml + ``` + +2. **Generate the ProwJob**: Run the generation tool: + ```bash + make jobs + ``` + +3. **Submit a PR**: The Test Platform team will review your configuration + +## Common Patterns + +### Pattern 1: Building and Testing Images +```yaml +images: +- from: base + to: my-app + dockerfile_path: Dockerfile + +tests: +- as: verify-image + commands: | + # Test the built image + podman run --rm ${IMAGE_FORMAT}/my-app:latest --version + container: + from: src +``` + +### Pattern 2: Running Tests Against a Live Cluster +```yaml +tests: +- as: cluster-tests + steps: + cluster_profile: aws + test: + - ref: my-test-suite # Reference a step from the registry + workflow: openshift-e2e-aws +``` + +### Pattern 3: Conditional Test Execution +```yaml +tests: +- as: optional-test + optional: true # Don't block PR merges + commands: make slow-tests + container: + from: src +``` + +## Debugging Failed Tests + +When your test fails: + +1. **Check the Prow UI**: Click on the failed test in your PR +2. **Look at artifacts**: Download logs from `artifacts/` directory +3. **Review the build log**: Check for compilation or setup errors +4. **Use the shared directory**: Add debug information: + ```bash + echo "Debug info" > ${SHARED_DIR}/debug.txt + ``` + +## Best Practices + +1. **Keep tests focused**: One test should validate one thing +2. **Use meaningful names**: `as: test-authentication` not `as: test1` +3. **Set appropriate timeouts**: Don't let tests run forever +4. **Clean up resources**: Always clean up in post steps +5. **Save useful artifacts**: Help future debugging + +## Next Steps + +- Explore the [Step Registry](https://steps.ci.openshift.org/) for reusable components +- Read about [Advanced Testing Patterns]({{< ref "../how-tos/creating-a-pipeline" >}}) +- Learn about [Adding Cluster Profiles]({{< ref "../how-tos/adding-a-cluster-profile" >}}) + +## Getting Help + +If you're stuck: +- Check the [Troubleshooting Guide]({{< ref "../troubleshooting/_index.md" >}}) +- Ask in `#forum-ocp-testplatform` on Slack +- Review [similar test configurations](https://github.com/openshift/release/tree/master/ci-operator/config) in other repositories \ No newline at end of file diff --git a/content/en/docs/how-tos/_index.md b/content/en/docs/how-tos/_index.md index 8a488b4d..aa44b9d4 100644 --- a/content/en/docs/how-tos/_index.md +++ b/content/en/docs/how-tos/_index.md @@ -3,5 +3,96 @@ title: "How To's" linkTitle: "How To's" weight: 2 description: > - This section contains How To's for various tasks + Step-by-step guides for common CI tasks, organized by category --- + +# How-To Guides + +This section contains practical guides for accomplishing specific tasks in OpenShift CI. + +## πŸš€ Getting Started + +### Setting Up CI +- [Onboarding a New Component]({{< ref "onboarding-a-new-component" >}}) - Complete guide to adding your project to CI +- [Contributing CI Configuration]({{< ref "contributing-openshift-release" >}}) - How to submit CI configuration changes +- [Naming Your CI Jobs]({{< ref "naming-your-ci-jobs" >}}) - Best practices for job naming + +### Writing Tests +- [Creating a Test Pipeline]({{< ref "creating-a-pipeline" >}}) - Building test workflows +- [Using External Images]({{< ref "external-images" >}}) - Incorporating external container images +- [Testing with Nested Podman]({{< ref "nested-podman" >}}) - Running container tests within CI + +## πŸ“Š Test Management + +### Test Configuration +- [Add Jobs to TestGrid]({{< ref "add-jobs-to-testgrid" >}}) - Monitor test results over time +- [Multi-Architecture Testing]({{< ref "multi-architecture" >}}) - Test on different architectures +- [Multi-PR Testing]({{< ref "multi-pr-presubmit-testing" >}}) - Test multiple PRs together + +### Test Execution +- [Interact with Running Jobs]({{< ref "interact-with-running-jobs" >}}) - Debug jobs in real-time +- [Override Failing CI Jobs]({{< ref "overriding-failing-ci-jobs" >}}) - Emergency procedures +- [Trigger Jobs via REST API]({{< ref "triggering-prowjobs-via-rest" >}}) - Programmatic job execution + +## πŸ”§ Advanced Configuration + +### Registry and Artifacts +- [Adding Step Registry Content]({{< ref "adding-changing-step-registry-content" >}}) - Create reusable test components +- [Migrating to Multi-Stage Tests]({{< ref "migrating-template-jobs-to-multistage" >}}) - Modernize legacy jobs +- [Managing Artifacts]({{< ref "artifacts" >}}) - Store and retrieve test outputs + +### Images and Promotion +- [Mirror Images to Quay]({{< ref "mirroring-to-quay" >}}) - External image publication +- [Use Registries in Build Farm]({{< ref "use-registries-in-build-farm" >}}) - Registry configuration + +## πŸ”’ Security and Access + +### Secrets and Credentials +- [Add a New Secret to CI]({{< ref "adding-a-new-secret-to-ci" >}}) - Manage sensitive data +- [Add a Cluster Profile]({{< ref "adding-a-cluster-profile" >}}) - Configure cloud access +- [RBAC Configuration]({{< ref "rbac" >}}) - Set up permissions + +### Security Scanning +- [Add Security Scanning]({{< ref "add-security-scanning" >}}) - Integrate vulnerability scanning +- [Private Repository Access]({{< ref "add-team-access-to-private-deck" >}}) - Configure private deck access + +## 🎯 Specialized Testing + +### Operator Testing +- [Testing Operator SDK Operators]({{< ref "testing-operator-sdk-operators" >}}) - OLM-based operator testing + +### Cluster Management +- [Using Cluster Claims]({{< ref "cluster-claim" >}}) - Pre-provisioned cluster pools +- [Platform Capabilities]({{< ref "capabilities" >}}) - Feature gate testing + +## πŸ“’ Monitoring and Notifications + +### Alerting +- [Set Up Notifications]({{< ref "notification" >}}) - Slack and email alerts +- [PR Reminder Bot]({{< ref "pr-reminder" >}}) - Automated PR notifications + +## πŸ“š Quick Reference + +### Common Tasks by Role + +**For Developers:** +1. Start with [Onboarding a New Component]({{< ref "onboarding-a-new-component" >}}) +2. Learn about [Creating a Test Pipeline]({{< ref "creating-a-pipeline" >}}) +3. Set up [Notifications]({{< ref "notification" >}}) for your jobs + +**For Test Engineers:** +1. Explore [Step Registry Content]({{< ref "adding-changing-step-registry-content" >}}) +2. Configure [Multi-Architecture Testing]({{< ref "multi-architecture" >}}) +3. Add jobs to [TestGrid]({{< ref "add-jobs-to-testgrid" >}}) + +**For Platform Engineers:** +1. Manage [Cluster Profiles]({{< ref "adding-a-cluster-profile" >}}) +2. Configure [RBAC]({{< ref "rbac" >}}) +3. Set up [Security Scanning]({{< ref "add-security-scanning" >}}) + +## Need Help? + +Can't find what you're looking for? +- Check the [Troubleshooting Guide]({{< ref "../troubleshooting/_index.md" >}}) +- Review [Examples]({{< ref "../getting-started/examples" >}}) +- Ask in `#forum-ocp-testplatform` on Slack diff --git a/content/en/docs/how-tos/onboarding-a-new-component.md b/content/en/docs/how-tos/onboarding-a-new-component.md index 4b621825..2d696953 100644 --- a/content/en/docs/how-tos/onboarding-a-new-component.md +++ b/content/en/docs/how-tos/onboarding-a-new-component.md @@ -2,12 +2,35 @@ title: Onboarding a New Component for Testing and Merge Automation description: How to onboard a new component repository to the CI system for testing and merge automation. --- +aliases: + - /docs/onboarding/ + - /docs/getting-started-ci/ + - /docs/new-component/ ## Overview -This document overviews the workflow for onboarding new public component repositories to the Openshift CI. Private +This guide walks you through adding your repository to OpenShift CI so it can run automated tests and participate in the merge automation workflow. + +### Before You Start + +**New to OpenShift CI?** Make sure you've read: +- [Core Concepts]({{< ref "../getting-started/concepts" >}}) - Understanding Prow, ci-operator, and jobs +- [Glossary]({{< ref "../getting-started/glossary" >}}) - Definitions of terms used in this guide + +### What You'll Set Up + +1. **GitHub permissions** - Allow CI robots to interact with your repository +2. **Prow configuration** - Enable GitHub automation (like `/retest` commands) +3. **CI Operator configuration** - Define how to build and test your code +4. **Test jobs** - Specify what tests run and when + +### Repository Types + +This document covers onboarding new public component repositories to OpenShift CI. Private repositories are also supported with a few caveats. More information can be found [here](/docs/architecture/private-repositories). +### Planning to Add Images to OpenShift? + If you are thinking about adding new images to OpenShift release payloads, read [this section](#product-builds-and-becoming-part-of-an-openshift-release) first, to avoid doing work you might have to adjust later. ## Granting Robots Privileges and Installing the GitHub App @@ -33,10 +56,16 @@ Both of them are required for automations to work properly, if one is missing yo ## Prow Configuration -[Prow](https://docs.prow.k8s.io/docs/overview/) is the k8s-native upstream CI system, source -code hosted in the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow) repository. Prow interacts with -GitHub to provide the automation UX that developers use on their pull requests, as well as orchestrating test workloads -for those pull requests. +[Prow](https://docs.prow.k8s.io/docs/overview/) is the Kubernetes-native CI system that powers OpenShift CI. It's what enables commands like `/retest` and automatic merging of approved PRs. + +### What Prow Does + +- **Responds to GitHub events**: New PRs, commits, comments +- **Schedules test jobs**: Decides when and where to run tests +- **Reports results**: Updates PR status checks and posts comments +- **Handles merging**: Automatically merges PRs when all conditions are met + +Prow is developed by the Kubernetes community in the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow) repository. ### Bootstrapping Configuration for a new Repository @@ -111,13 +140,21 @@ require under a new `plugins.yaml["plugins"]["$org/$repo"]` key. ### Describing Tests -Prow provides the following test trigger types: +Prow supports different types of tests that run at different times in your development workflow: -|Type Name|Trigger|Target|Purpose| +#### Test Types Explained + +|Type Name|When It Runs|What It Tests|Common Use Cases| |:---|:---|:---|:---| -|`presubmit`|Push to a PR|A single PR merged into the branch it is targeting|Testing commits within a PR before they are merged| -|`postsubmit`|Push/merge to a branch|User specified set of branches|Integration tests after a PR is merged| -|`periodic`|`cron`-like schedule|User-specified set of branches|Scheduled test runs| +|`presubmit`|On every PR update|Your PR changes merged with the target branch|Unit tests, linting, e2e tests - anything to validate the PR is good| +|`postsubmit`|After PR merges|The actual state of the branch after merge|Building and publishing images, updating docs, integration tests| +|`periodic`|On a schedule (like cron)|The current state of specified branches|Nightly builds, long-running tests, regular health checks| + +#### Example Scenarios + +- **Presubmit**: Run unit tests and linting on every PR to ensure code quality +- **Postsubmit**: Build and push container images after changes are merged +- **Periodic**: Run expensive integration tests nightly instead of on every PR Configuration for your repository’s tests live in YAML files in the [openshift/release](https://github.com/openshift/release) repository. Jobs are stored in diff --git a/content/en/docs/troubleshooting/_index.md b/content/en/docs/troubleshooting/_index.md new file mode 100644 index 00000000..0877f26d --- /dev/null +++ b/content/en/docs/troubleshooting/_index.md @@ -0,0 +1,124 @@ +--- +title: "Troubleshooting Guide" +linkTitle: "Troubleshooting" +weight: 3 +description: > + Common issues and how to debug CI problems +--- + +# Troubleshooting OpenShift CI + +This guide helps you diagnose and fix common issues with OpenShift CI jobs. Based on real questions from the community, it covers the most frequent problems and their solutions. + +## Quick Diagnosis + +Start here to identify your issue: + +### My Job Won't Start +- **Symptom**: Job stays pending or doesn't trigger +- **Go to**: [Job Execution Issues]({{< ref "job-execution-issues" >}}) + +### My Job Fails +- **Symptom**: Job runs but exits with error +- **Go to**: [Debugging Failed Jobs]({{< ref "debugging-failed-jobs" >}}) + +### Cluster Issues +- **Symptom**: Can't create cluster, cluster fails to install +- **Go to**: [Cluster Problems]({{< ref "cluster-problems" >}}) + +### Configuration Errors +- **Symptom**: Invalid YAML, rehearsal failures +- **Go to**: [Configuration Issues]({{< ref "configuration-issues" >}}) + +### Access and Permissions +- **Symptom**: Can't access resources, permission denied +- **Go to**: [Access Issues]({{< ref "access-issues" >}}) + +## Common Error Messages + +Quick solutions for frequently seen errors: + +| Error | Solution | +|-------|----------| +| `error creating cluster: no available quota` | Check [quota limits]({{< ref "cluster-problems#quota-issues" >}}) | +| `failed to resolve release` | See [release resolution]({{< ref "configuration-issues#release-errors" >}}) | +| `permission denied` | Review [RBAC setup]({{< ref "access-issues#rbac" >}}) | +| `timeout waiting for pod` | Check [timeout configuration]({{< ref "debugging-failed-jobs#timeouts" >}}) | +| `unable to find secret` | Verify [secret setup]({{< ref "access-issues#secrets" >}}) | + +## General Debugging Steps + +For any CI issue, follow these steps: + +### 1. Check the Prow UI +- Navigate to your PR on GitHub +- Click on the failed job status +- Look for error messages in the job log + +### 2. Review Artifacts +Most jobs save debugging information: +``` +artifacts/ +β”œβ”€β”€ build-log.txt # Main job output +β”œβ”€β”€ e2e/ # Test logs +β”œβ”€β”€ junit/ # Test results +└── must-gather/ # Cluster diagnostic data +``` + +### 3. Check Recent Changes +- Did your PR modify CI configuration? +- Were there recent changes to base images? +- Is this a new failure or recurring issue? + +### 4. Use Debug Commands +Add debugging to your test: +```bash +# Print environment +env | grep -E "(CLUSTER|KUBE|OPENSHIFT)" | sort + +# Check cluster status +oc get nodes +oc get clusterversion +oc get clusteroperators + +# Save debug info +oc adm must-gather --dest-dir=${ARTIFACT_DIR}/must-gather +``` + +## Getting Help + +If you can't resolve your issue: + +1. **Search First** + - Check this troubleshooting guide + - Search Slack history in `#forum-ocp-testplatform` + - Look for similar issues in [Jira](https://issues.redhat.com/projects/DPTP) + +2. **Gather Information** + - Job URL + - Error messages + - What you've already tried + - Relevant configuration snippets + +3. **Ask for Help** + - Use the "Ask a Question" workflow in `#forum-ocp-testplatform` + - Include all gathered information + - Be specific about what you're trying to achieve + +## Preventing Issues + +Best practices to avoid common problems: + +- **Test locally first**: Validate YAML and scripts before pushing +- **Start simple**: Begin with basic tests, add complexity gradually +- **Use existing patterns**: Copy from working examples +- **Monitor your jobs**: Set up alerts for failures +- **Keep configurations DRY**: Use the step registry for reusable components + +## Next Steps + +- [Job Execution Issues]({{< ref "job-execution-issues" >}}) - Jobs that won't start +- [Debugging Failed Jobs]({{< ref "debugging-failed-jobs" >}}) - Jobs that fail +- [Cluster Problems]({{< ref "cluster-problems" >}}) - Cluster creation/access issues +- [Configuration Issues]({{< ref "configuration-issues" >}}) - YAML and setup problems +- [Access Issues]({{< ref "access-issues" >}}) - Permissions and secrets \ No newline at end of file diff --git a/content/en/docs/troubleshooting/access-issues.md b/content/en/docs/troubleshooting/access-issues.md new file mode 100644 index 00000000..8aa30825 --- /dev/null +++ b/content/en/docs/troubleshooting/access-issues.md @@ -0,0 +1,381 @@ +--- +title: "Access Issues" +description: Resolving permission, authentication, and secrets problems +weight: 5 +--- + +# Access Issues + +This guide helps you resolve access-related problems including permissions, authentication failures, and secrets issues. + +## Quick Diagnosis + +| Error | Type | Solution | +|-------|------|----------| +| "Permission denied" | RBAC | [RBAC Issues](#rbac-issues) | +| "Unable to find secret" | Secrets | [Secret Problems](#secret-problems) | +| "Unauthorized" | Auth | [Authentication Issues](#authentication-issues) | +| "Forbidden" | Access | [Repository Access](#repository-access) | + +## RBAC Issues {#rbac} + +### Problem: Permission Denied in Cluster + +**Error**: +``` +Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:ci:default" cannot list resource "pods" +``` + +**Solutions**: + +1. **For test pods**: + ```yaml + # Tests run with limited permissions by default + # Use cluster-admin for admin access + tests: + - as: admin-test + steps: + cluster_profile: aws + test: + - as: needs-admin + cli: latest # Provides oc CLI with admin kubeconfig + commands: | + oc adm policy add-scc-to-user privileged -z default + ``` + +2. **For namespace access**: + ```bash + # Grant specific permissions + oc create rolebinding my-binding \ + --clusterrole=edit \ + --serviceaccount=ci:default \ + -n target-namespace + ``` + +### Problem: Cannot Access CI Namespace + +**Symptoms**: +- Cannot view jobs in Prow UI +- Permission denied when debugging + +**Solutions**: + +1. **Verify GitHub teams**: + - Ensure you're in the correct GitHub organization + - Check team membership for repository + +2. **SSO login issues**: + ```bash + # Re-authenticate + oc logout + oc login --web-console + ``` + +3. **Request access**: + - File Jira ticket for namespace access + - Include GitHub username and repositories + +## Secret Problems {#secrets} + +### Problem: Secret Not Found + +**Error**: +``` +error: unable to find secret "my-secret" in namespace "test-credentials" +``` + +**Common causes and fixes**: + +1. **Secret not synced yet**: + - Secrets sync every 30 minutes + - Check Vault for secret presence + - Wait for sync or request manual sync + +2. **Wrong namespace**: + ```yaml + # Secrets must be in test-credentials + credentials: + - namespace: test-credentials # Required + name: my-secret + mount_path: /var/run/secrets/my-secret + ``` + +3. **Incorrect secret configuration**: + ```yaml + # In Vault, ensure these keys exist: + secretsync/target-namespace: "test-credentials" + secretsync/target-name: "my-secret" + ``` + +### Problem: Secret Content Issues + +**Symptoms**: +- Secret exists but content is wrong +- Authentication still fails with secret + +**Debugging**: + +1. **Verify secret content**: + ```bash + # In test step + echo "Secret contents:" + ls -la ${CLUSTER_PROFILE_DIR}/ + cat ${CLUSTER_PROFILE_DIR}/secret-key || echo "Key missing" + ``` + +2. **Check secret encoding**: + ```bash + # Secrets should not be double-encoded + # If seeing base64 in file, it's double-encoded + base64 -d ${CLUSTER_PROFILE_DIR}/pull-secret > decoded.json + ``` + +3. **Format issues**: + ```yaml + # AWS credentials format + [default] + aws_access_key_id=XXXX + aws_secret_access_key=YYYY + ``` + +### Problem: Cluster Profile Secrets + +**Error**: +``` +Failed to find credentials in cluster profile +``` + +**Solutions**: + +1. **Verify cluster profile exists**: + ```bash + # Check available profiles + ls ci-operator/step-registry/cluster-profiles/ + ``` + +2. **Check secret mounting**: + ```yaml + tests: + - as: cloud-test + steps: + cluster_profile: aws # Must match existing profile + env: + # Credentials available at + # ${CLUSTER_PROFILE_DIR}/credentials + ``` + +## Authentication Issues + +### Problem: Registry Authentication Failed + +**Error**: +``` +Failed to pull image: authentication required +``` + +**Solutions**: + +1. **Use CI pull credentials**: + ```yaml + tests: + - as: pull-private + steps: + test: + - as: use-private-image + credentials: + - namespace: test-credentials + name: ci-pull-credentials + mount_path: /var/run/secrets/ci-pull-credentials + commands: | + export REGISTRY_AUTH_FILE=/var/run/secrets/ci-pull-credentials/.dockerconfigjson + podman pull registry.example.com/private/image:tag + ``` + +2. **For Red Hat registries**: + ```yaml + # Built-in credentials + credentials: + - namespace: ci + name: ci-pull-credentials + mount_path: /var/run/secrets/redhat + ``` + +### Problem: Git Authentication + +**Symptoms**: +- Cannot clone private repositories +- SSH key authentication fails + +**Solutions**: + +1. **Use SSH keys from secrets**: + ```yaml + credentials: + - namespace: test-credentials + name: git-ssh-key + mount_path: /var/run/secrets/ssh + commands: | + # Configure SSH + mkdir -p ~/.ssh + cp /var/run/secrets/ssh/id_rsa ~/.ssh/ + chmod 600 ~/.ssh/id_rsa + + # Add host key + ssh-keyscan github.com >> ~/.ssh/known_hosts + + # Clone + git clone git@github.com:org/private-repo.git + ``` + +2. **Use token authentication**: + ```bash + # With personal access token + git clone https://${GITHUB_TOKEN}@github.com/org/private-repo.git + ``` + +## Repository Access + +### Problem: Cannot Access Private Repository + +**Error**: +``` +Repository not found or permission denied +``` + +**For openshift-priv repos**: + +1. **Check mirror configuration**: + ```yaml + # In .ci-operator.yaml + private: true + expose: true # If you want jobs visible + ``` + +2. **Verify sync status**: + - Private repos sync periodically + - Check if your repo is in sync allowlist + +### Problem: Deck UI Access + +**Symptoms**: +- Cannot see job results +- Artifacts not accessible + +**Solutions**: + +1. **For private deck**: + - Request access through Jira + - Must be in appropriate Rover group + +2. **Check job configuration**: + ```yaml + # Jobs may be configured for private deck + decoration_config: + gcs_configuration: + bucket: origin-ci-test-private + ``` + +## Cloud Provider Access + +### Problem: Cloud Credentials Invalid + +**Error**: +``` +AuthFailure: AWS was not able to validate the provided access credentials +``` + +**Debugging**: + +1. **Test credentials manually**: + ```bash + # In test step + export AWS_SHARED_CREDENTIALS_FILE=${CLUSTER_PROFILE_DIR}/.awscred + aws sts get-caller-identity + ``` + +2. **Check credential format**: + ```bash + # Should contain + cat ${CLUSTER_PROFILE_DIR}/.awscred + # [default] + # aws_access_key_id=XXXX + # aws_secret_access_key=YYYY + ``` + +3. **Verify region configuration**: + ```bash + export AWS_DEFAULT_REGION=us-east-1 + aws ec2 describe-regions + ``` + +## Prevention and Best Practices + +### 1. Test Access Early + +Before running expensive tests: +```yaml +tests: +- as: verify-access + steps: + cluster_profile: aws + test: + - as: check-creds + commands: | + # Verify AWS access + aws sts get-caller-identity + + # Verify secret mounting + ls -la ${CLUSTER_PROFILE_DIR}/ + from: cli +``` + +### 2. Document Secret Requirements + +In your repository README: +```markdown +## Required Secrets + +This CI configuration requires: +- `my-app-credentials`: API credentials for X +- `my-app-github-token`: GitHub access token + +Contact @team to request access. +``` + +### 3. Use Least Privilege + +Only request permissions you need: +```yaml +# Bad - requests everything +credentials: +- namespace: ci + name: admin-credentials + +# Good - specific secret +credentials: +- namespace: test-credentials + name: my-app-readonly-creds +``` + +## Getting Help + +When facing access issues: + +1. **Gather information**: + - Exact error message + - Secret/credential names + - Job configuration + +2. **Check documentation**: + - Verify secret setup steps + - Confirm naming conventions + +3. **Request assistance**: + - Use Slack workflows for access requests + - Include all debugging information + +## Next Steps + +- [Secret Management](/docs/how-tos/adding-a-new-secret-to-ci/) - Adding new secrets +- [Cluster Profiles](/docs/how-tos/adding-a-cluster-profile/) - Cloud access setup +- [RBAC Guide](/docs/how-tos/rbac/) - Permission management \ No newline at end of file diff --git a/content/en/docs/troubleshooting/cluster-problems.md b/content/en/docs/troubleshooting/cluster-problems.md new file mode 100644 index 00000000..9b56346d --- /dev/null +++ b/content/en/docs/troubleshooting/cluster-problems.md @@ -0,0 +1,402 @@ +--- +title: "Cluster Problems" +description: Troubleshooting cluster creation, access, and stability issues +weight: 3 +--- + +# Cluster Problems + +This guide helps you resolve issues related to OpenShift clusters in CI, including creation failures, access problems, and cluster instability. + +## Quick Diagnosis + +| Symptom | Likely Cause | Go To | +|---------|--------------|-------| +| "No available quota" | Quota limits reached | [Quota Issues](#quota-issues) | +| "Failed to create cluster" | Installation failure | [Installation Failures](#installation-failures) | +| "Cannot connect to cluster" | Network/auth issues | [Access Problems](#access-problems) | +| "Cluster operators degraded" | Cluster unhealthy | [Cluster Health](#cluster-health-issues) | + +## Quota Issues + +### Problem: No Available Quota + +**Error message**: +``` +error creating cluster: failed to acquire lease: no available quota +``` + +**Causes**: +- Cloud account quota exhausted +- Too many concurrent jobs +- Leaked resources from failed jobs + +**Solutions**: + +1. **Check current quota usage**: + ```bash + # View quota consumption in Grafana + # https://grafana-route-ci-grafana.apps.ci.l2s4.p1.openshiftapps.com/ + ``` + +2. **Wait and retry**: + - Most quota is released within 1-2 hours + - Use `/retest` command after waiting + +3. **Reduce concurrent jobs**: + ```yaml + # Limit job concurrency + max_concurrency: 5 # Default is 10 + ``` + +4. **Report persistent issues**: + - Use "Report CI Outage" workflow in Slack + - Include job links and error messages + +### Problem: Specific Region Quota + +**Error**: Quota exhausted in specific region (e.g., us-east-1) + +**Solutions**: +```yaml +# Use a different region +tests: +- as: e2e-aws-west + steps: + cluster_profile: aws-2 # Uses us-west-2 + env: + AWS_REGION: us-west-2 + workflow: openshift-e2e-aws +``` + +## Installation Failures + +### Problem: Cluster Installation Timeout + +**Error**: +``` +level=error msg="Cluster operator X did not become available" +``` + +**Common causes**: +- Infrastructure issues +- Network problems +- Invalid configuration + +**Debugging steps**: + +1. **Check installation logs**: + ```bash + # In artifacts directory + installer/.openshift_install.log + installer/events.json + ``` + +2. **Review cluster operator status**: + ``` + # Look in artifacts/ + oc_cmds/oc_get_clusteroperators + oc_cmds/oc_get_nodes + ``` + +3. **Common operator issues**: + - **authentication**: Often cert-manager related + - **ingress**: Usually DNS or load balancer issues + - **monitoring**: Typically storage problems + - **machine-api**: Cloud provider API issues + +### Problem: Bootstrap Failure + +**Error**: +``` +Bootstrap failed to complete: timed out waiting for the condition +``` + +**Solutions**: + +1. **Check bootstrap logs**: + - Look for `bootstrap/journals/` in artifacts + - Review `bootkube.service` logs + +2. **Verify cloud resources**: + ```yaml + # Add gather steps for debugging + tests: + - as: e2e-debug + steps: + cluster_profile: aws + post: + - ref: ipi-aws-gather # Gathers cloud resources + workflow: openshift-e2e-aws + ``` + +### Problem: DNS Issues + +**Error**: +``` +error: waiting for API: Get "https://api.cluster.example.com:6443": dial tcp: lookup api.cluster.example.com: no such host +``` + +**Solutions**: + +1. **Verify DNS configuration**: + ```bash + # Check Route53 (AWS) or Cloud DNS (GCP) + nslookup api.${CLUSTER_NAME}.${BASE_DOMAIN} + ``` + +2. **Check base domain**: + ```yaml + # Ensure correct base domain + steps: + cluster_profile: aws + env: + BASE_DOMAIN: aws.ci.openshift.org # Must match profile + ``` + +## Access Problems + +### Problem: Cannot Connect to Cluster + +**Error**: +``` +Unable to connect to the server: dial tcp: i/o timeout +``` + +**Debugging**: + +1. **Verify kubeconfig**: + ```bash + # Check if KUBECONFIG is set + echo $KUBECONFIG + + # Test connection + oc whoami + oc get nodes + ``` + +2. **For claimed clusters**: + ```bash + # Kubeconfig location differs + export KUBECONFIG=${SHARED_DIR}/kubeconfig + ``` + +3. **Network connectivity**: + ```bash + # Test API endpoint + curl -k https://api.${CLUSTER_NAME}.${BASE_DOMAIN}:6443/healthz + ``` + +### Problem: Authentication Failed + +**Error**: +``` +error: You must be logged in to the server (Unauthorized) +``` + +**Solutions**: + +1. **Check credentials**: + ```bash + # For installer-provisioned clusters + export KUBECONFIG=${SHARED_DIR}/kubeconfig + + # For claimed clusters + export KUBECONFIG=${KUBECONFIG:-${SHARED_DIR}/kubeconfig} + ``` + +2. **Verify kubeadmin password**: + ```bash + # Password location + cat ${KUBEADMIN_PASSWORD_FILE} + + # Login as kubeadmin + oc login -u kubeadmin -p $(cat ${KUBEADMIN_PASSWORD_FILE}) + ``` + +## Cluster Health Issues + +### Problem: Degraded Cluster Operators + +**Identify issues**: +```bash +# Check operator status +oc get clusteroperators + +# Get details on degraded operator +oc describe clusteroperator + +# Check operator pods +oc get pods -n openshift--operator +``` + +**Common fixes**: + +1. **Storage issues**: + ```bash + # Check PVCs + oc get pvc --all-namespaces | grep -v Bound + + # Check storage class + oc get storageclass + ``` + +2. **Network problems**: + ```bash + # Check network operator + oc get network.operator cluster -o yaml + + # Verify pod networking + oc get pods --all-namespaces | grep -v Running + ``` + +3. **Certificate issues**: + ```bash + # Check cert expiration + oc get certificatesigningrequests + + # Approve pending CSRs + oc get csr -o name | xargs -I {} oc adm certificate approve {} + ``` + +### Problem: Node Issues + +**Symptoms**: +- Nodes NotReady +- Pod scheduling failures +- High resource usage + +**Debugging**: +```bash +# Check node status +oc get nodes +oc describe node + +# Check node resources +oc adm top nodes + +# Review node logs +oc adm node-logs + +# Check kubelet status +oc adm node-logs -u kubelet +``` + +## Platform-Specific Issues + +### AWS + +**Common issues**: +- IAM permission problems +- VPC limit reached +- EBS volume attachment failures + +**Debug commands**: +```bash +# Check AWS resources in artifacts +installer/metadata.json +installer/terraform.tfstate + +# Verify IAM permissions +aws sts get-caller-identity +``` + +### GCP + +**Common issues**: +- Quota exceeded in project +- Service account permissions +- Network security policies + +**Debug commands**: +```bash +# Check GCP resources +gcloud compute instances list +gcloud compute networks list +``` + +### Azure + +**Common issues**: +- Resource group limits +- Virtual network conflicts +- Subscription quota + +**Debug commands**: +```bash +# Check Azure resources +az vm list --resource-group ${CLUSTER_NAME}-rg +az network vnet list --resource-group ${CLUSTER_NAME}-rg +``` + +## Prevention Best Practices + +### 1. Use Cluster Pools + +Instead of provisioning new clusters: +```yaml +tests: +- as: e2e-pool + cluster_claim: + architecture: amd64 + cloud: aws + owner: openshift-ci + product: ocp + timeout: 1h0m0s + version: "4.15" + steps: + test: + - ref: my-tests +``` + +### 2. Set Appropriate Timeouts + +Don't wait forever for broken clusters: +```yaml +tests: +- as: e2e-timeout + timeout: 2h0m0s # Overall job timeout + steps: + cluster_profile: aws + env: + CLUSTER_INSTALL_TIMEOUT: "60m" # Installation timeout +``` + +### 3. Add Cleanup Steps + +Ensure resources are released: +```yaml +tests: +- as: e2e-cleanup + steps: + cluster_profile: aws + post: + - ref: ipi-aws-post # Standard cleanup + - ref: my-cleanup # Additional cleanup + workflow: openshift-e2e-aws +``` + +## Getting Help + +When cluster issues persist: + +1. **Collect debugging data**: + - Installation logs + - Cluster operator status + - Must-gather output + +2. **Check for known issues**: + - Search Slack history + - Review similar job failures + - Check component Jira tickets + +3. **Report the issue**: + - Use Slack workflow for outages + - Include cluster details and logs + - Tag relevant team if known + +## Next Steps + +- [Debugging Failed Jobs]({{< ref "debugging-failed-jobs" >}}) - For test failures after cluster creation +- [Configuration Issues]({{< ref "configuration-issues" >}}) - For cluster configuration problems +- [Access Issues]({{< ref "access-issues" >}}) - For permission and authentication problems \ No newline at end of file diff --git a/content/en/docs/troubleshooting/configuration-issues.md b/content/en/docs/troubleshooting/configuration-issues.md new file mode 100644 index 00000000..31643bb4 --- /dev/null +++ b/content/en/docs/troubleshooting/configuration-issues.md @@ -0,0 +1,340 @@ +--- +title: "Configuration Issues" +description: How to fix CI configuration problems and YAML errors +weight: 4 +--- + +# Configuration Issues + +This guide helps you resolve configuration problems in your CI setup, including YAML syntax errors, validation failures, and rehearsal issues. + +## Common Configuration Problems + +### YAML Syntax Errors + +**Symptom**: Job fails to load with parsing error +``` +error parsing config: yaml: line 10: found character that cannot start any token +``` + +**Solutions**: +1. Validate your YAML: + ```bash + # Use yamllint + yamllint ci-operator/config/org/repo/org-repo-main.yaml + + # Or use yq + yq eval '.' config.yaml > /dev/null + ``` + +2. Common YAML mistakes: + - Tabs instead of spaces (use spaces only!) + - Incorrect indentation + - Missing quotes around special characters + - Wrong dash/hyphen placement + +**Example Fix**: +```yaml +# Wrong - uses tabs +tests: + - as: my-test + steps: + test: + +# Correct - uses spaces +tests: +- as: my-test + steps: + test: +``` + +### Schema Validation Errors + +**Symptom**: Configuration rejected by ci-operator +``` +error validating configuration: tests[0].steps.test[0]: unknown field "command" +``` + +**Solutions**: +1. Check field names (it's `commands`, not `command`) +2. Verify field placement +3. Consult the schema documentation + +**Common schema issues**: +```yaml +# Wrong field name +- as: test + command: echo "test" # ❌ Should be "commands" + +# Correct +- as: test + commands: echo "test" # βœ“ + +# Wrong placement +tests: +- as: test + timeout: 2h # ❌ timeout goes at test level, not step + steps: + test: + - as: step + commands: test + +# Correct +tests: +- as: test + timeout: 2h # βœ“ + steps: + test: + - as: step + commands: test +``` + +### Missing Required Fields + +**Symptom**: Validation errors about missing fields +``` +error validating configuration: tests[0].steps: missing required field "cluster_profile" +``` + +**Solutions**: +```yaml +# For multi-stage tests, cluster_profile is often required +tests: +- as: e2e + steps: + cluster_profile: aws # Add this + workflow: openshift-e2e-aws +``` + +## Release Configuration Issues + +### Release Resolution Errors + +**Symptom**: Cannot resolve release payload +``` +failed to resolve release latest: no release configuration found +``` + +**Solutions**: +1. Define releases in your config: + ```yaml + releases: + latest: + integration: + namespace: ocp + name: "4.15" + ``` + +2. For stable releases: + ```yaml + releases: + latest: + release: + channel: stable + version: "4.14" + ``` + +### Image Resolution Problems + +**Symptom**: Cannot find required images +``` +error: image "installer" not found in imagestream +``` + +**Solutions**: +1. Check if image exists in the release: + ```bash + oc get imagestream -n ocp 4.15 -o yaml | grep installer + ``` + +2. Use correct image references: + ```yaml + base_images: + installer: + namespace: ocp + name: "4.15" + tag: installer + ``` + +## Step Registry Errors + +### Step Not Found + +**Symptom**: Referenced step doesn't exist +``` +error: step "my-custom-step" not found in registry +``` + +**Solutions**: +1. Verify step name and location: + ```bash + # Check if step exists + ls ci-operator/step-registry/my/custom/step/ + ``` + +2. Ensure proper naming convention: + - Step ref: `my-custom-step` + - File: `my-custom-step-ref.yaml` + - Commands: `my-custom-step-commands.sh` + +### Circular Dependencies + +**Symptom**: Chain references create a loop +``` +error: circular dependency detected in chain "my-chain" +``` + +**Solutions**: +- Review chain definitions +- Remove circular references +- Simplify chain structure + +## Promotion and Mirroring Issues + +### Promotion Configuration + +**Symptom**: Images not promoted correctly +``` +Failed to promote images: no promotion configuration +``` + +**Solutions**: +```yaml +promotion: + to: + - namespace: ocp + name: "4.15" + # or for namespace/name pattern: + # namespace: my-namespace + # tag: latest +``` + +### Image Mirroring Problems + +**Symptom**: Cannot mirror to external registry +``` +error mirroring image: unauthorized +``` + +**Solutions**: +1. Verify registry credentials exist +2. Check mirror configuration: + ```yaml + images: + - from: base + to: my-app + mirror_to: + - quay.io/myorg/myapp:latest + ``` + +## Validation and Rehearsal Issues + +### Rehearsal Failures + +**Symptom**: PR tests fail in rehearsals +``` +REHEARSAL FAILURE: configuration changes break existing jobs +``` + +**Solutions**: +1. Run rehearsals locally: + ```bash + make rehearse + ``` + +2. Common rehearsal issues: + - Renaming jobs (breaks TestGrid) + - Removing required jobs + - Changing job types + +### Pre-submit vs Periodic Configuration + +**Problem**: Different requirements for job types + +**Pre-submit jobs**: +```yaml +tests: +- as: unit + commands: make test + container: + from: src + # Pre-submit specific options + run_if_changed: "^pkg/" + optional: true +``` + +**Periodic jobs**: +```yaml +tests: +- as: nightly-e2e + cron: "0 0 * * *" + steps: + cluster_profile: aws + workflow: openshift-e2e-aws +``` + +## Best Practices for Configuration + +### 1. Use Configuration Hierarchy + +Reduce duplication with shared configurations: + +```yaml +# In ci-operator/config/org/repo/org-repo-main.yaml +tests: +- as: unit + commands: make test + container: + from: src + +# In ci-operator/config/org/repo/org-repo-release-4.15.yaml +# Inherits from main, adds version-specific tests +tests: +- as: e2e-4.15 + steps: + cluster_profile: aws + workflow: openshift-e2e-aws +``` + +### 2. Validate Before Submitting + +Always validate your configuration: +```bash +# Check YAML syntax +yamllint ci-operator/config/org/repo/*.yaml + +# Validate against schema +make validate-config + +# Test job generation +make jobs + +# Run rehearsals +make rehearse +``` + +### 3. Use Existing Patterns + +Copy from working examples: +```bash +# Find similar configurations +grep -r "workflow: openshift-e2e" ci-operator/config/ + +# Look for specific test patterns +grep -r "cluster_profile: aws" ci-operator/config/ | grep -v release- +``` + +## Getting Help + +When stuck with configuration: + +1. **Check examples**: Look at similar repos +2. **Validate locally**: Use make targets +3. **Read errors carefully**: They often indicate the fix +4. **Ask for review**: Tag `@openshift/test-platform` on your PR + +## Next Steps + +- [Step Registry Guide](/docs/architecture/step-registry/) - Understanding reusable components +- [Examples](/docs/getting-started/examples/) - Common configuration patterns +- [Job Execution Issues]({{< ref "job-execution-issues" >}}) - When jobs won't run \ No newline at end of file diff --git a/content/en/docs/troubleshooting/debugging-failed-jobs.md b/content/en/docs/troubleshooting/debugging-failed-jobs.md new file mode 100644 index 00000000..e6e3c34c --- /dev/null +++ b/content/en/docs/troubleshooting/debugging-failed-jobs.md @@ -0,0 +1,289 @@ +--- +title: "Debugging Failed Jobs" +description: How to investigate and fix CI job failures +weight: 2 +keywords: ["troubleshooting", "debug", "errors", "problems", "issues", "help", "debugging", "fix"] +aliases: + - /docs/debug/ + - /docs/fix-jobs/ + - /docs/job-failures/ +--- + +# Debugging Failed Jobs + +This guide helps you investigate why a CI job failed and how to fix it. + +## Quick Checklist + +When a job fails, check these in order: + +1. **Build Log** - Did the job start correctly? +2. **Test Output** - What specific test failed? +3. **Artifacts** - Are there error logs or must-gather data? +4. **Infrastructure** - Was there a cluster or network issue? +5. **Recent Changes** - What changed since it last worked? + +## Common Failure Types + +### Test Failures + +**Symptom**: Job runs but tests fail +``` +FAIL: TestMyFeature (10.23s) + myfeature_test.go:42: expected X but got Y +``` + +**Solutions**: +- Review the test output in artifacts +- Check if the test is flaky (fails intermittently) +- Verify test assumptions are correct +- Run the test locally to reproduce + +### Build Failures + +**Symptom**: Compilation or image build errors +``` +error: build error: unable to build image +``` + +**Solutions**: +- Check for syntax errors in code +- Verify base image availability +- Ensure all dependencies are specified +- Review Dockerfile for issues + +### Timeout Failures + +**Symptom**: Job killed after time limit +``` +error: Process did not finish before 4h0m0s timeout +``` + +**Solutions**: +- Increase timeout in job configuration: + ```yaml + tests: + - as: slow-test + timeout: 6h0m0s # Increase from default 4h + steps: + # ... + ``` +- Optimize test execution +- Split into multiple smaller jobs +- Check for hanging processes + +### Resource Failures + +**Symptom**: Out of memory or CPU +``` +Container exceeded memory limit +``` + +**Solutions**: +- Increase resource requests: + ```yaml + tests: + - as: memory-intensive + steps: + test: + - as: test + resources: + requests: + cpu: 2 + memory: 8Gi + limits: + memory: 10Gi + ``` +- Optimize resource usage +- Check for memory leaks + +## Debugging Techniques + +### 1. Add Debug Output + +Enhance your test with debugging information: + +```bash +#!/bin/bash +set -euxo pipefail # Print commands as they execute + +# Save environment for debugging +env | sort > ${ARTIFACT_DIR}/environment.txt + +# Add timing information +date > ${ARTIFACT_DIR}/test-start.txt + +# Your test commands here +make test || { + echo "Test failed, gathering debug info..." + + # Capture system state + df -h > ${ARTIFACT_DIR}/disk-usage.txt + ps aux > ${ARTIFACT_DIR}/processes.txt + + # If using a cluster + oc get nodes -o wide > ${ARTIFACT_DIR}/nodes.txt + oc get pods --all-namespaces > ${ARTIFACT_DIR}/pods.txt + + exit 1 +} + +date > ${ARTIFACT_DIR}/test-end.txt +``` + +### 2. Use Test Artifacts + +Always save important files: + +```bash +# Save test results +go test -v ./... 2>&1 | tee ${ARTIFACT_DIR}/test-output.txt + +# Save coverage +go test -coverprofile=${ARTIFACT_DIR}/coverage.out ./... + +# Save any generated files +cp -r generated/ ${ARTIFACT_DIR}/ + +# For multi-stage tests, use SHARED_DIR +echo "important-value" > ${SHARED_DIR}/my-data.txt +``` + +### 3. Interactive Debugging + +For complex issues, use interactive debugging: + +1. **SSH into the test pod** (if still running): + ```bash + oc debug pod/ -n + ``` + +2. **Run a debug container**: + ```yaml + tests: + - as: debug-env + commands: | + # Keep pod running for debugging + echo "Debugging environment ready" + sleep 3600 # Keep alive for 1 hour + container: + from: src + ``` + +### 4. Check Infrastructure Issues + +Sometimes failures are infrastructure-related: + +```bash +# Check cluster health +oc get clusteroperators +oc get nodes +oc describe node + +# Check pod events +oc get events --sort-by='.lastTimestamp' + +# Check resource usage +oc adm top nodes +oc adm top pods +``` + +## Analyzing Specific Errors + +### "No such file or directory" + +**Common causes**: +- File not included in image +- Wrong working directory +- Path typo + +**Debug**: +```bash +# List files to verify +ls -la +find . -name "expected-file" +pwd +``` + +### "Connection refused" or "Network timeout" + +**Common causes**: +- Service not ready +- Firewall/network policy +- Wrong endpoint + +**Debug**: +```bash +# Check service availability +curl -v http://service:port/health +netstat -an | grep LISTEN +oc get svc +oc get endpoints +``` + +### "Image pull errors" + +**Common causes**: +- Image doesn't exist +- Registry authentication failed +- Network issues + +**Debug**: +```bash +# Check image availability +skopeo inspect docker://registry.example.com/image:tag + +# Verify pull secret +oc get secret pull-secret -o yaml +``` + +## Working with Flaky Tests + +If a test fails intermittently: + +1. **Check historical pass rate**: + - Look at TestGrid for patterns + - Review recent PR results + +2. **Add retries for known flakes**: + ```go + func TestFlakyFeature(t *testing.T) { + for i := 0; i < 3; i++ { + err := tryTest() + if err == nil { + return + } + t.Logf("Attempt %d failed: %v", i+1, err) + time.Sleep(time.Second * 10) + } + t.Fatal("Test failed after 3 attempts") + } + ``` + +3. **Report persistent flakes**: + - File a Jira issue + - Consider marking test as `optional: true` + +## Getting Help + +If you're still stuck: + +1. **Collect all relevant information**: + - Job URL + - Error messages + - What you've tried + - Recent changes + +2. **Ask in Slack**: + - `#forum-ocp-testplatform` for CI issues + - Component-specific channels for test failures + +3. **File an issue**: + - Use Slack workflows to create Jira tickets + - Include reproduction steps + +## Next Steps + +- [Job Execution Issues]({{< ref "job-execution-issues" >}}) - If your job won't start +- [Configuration Issues]({{< ref "configuration-issues" >}}) - For YAML and setup problems +- [Cluster Problems]({{< ref "cluster-problems" >}}) - For cluster-specific failures \ No newline at end of file diff --git a/content/en/docs/troubleshooting/job-execution-issues.md b/content/en/docs/troubleshooting/job-execution-issues.md new file mode 100644 index 00000000..ff99f1b6 --- /dev/null +++ b/content/en/docs/troubleshooting/job-execution-issues.md @@ -0,0 +1,362 @@ +--- +title: "Job Execution Issues" +description: Troubleshooting jobs that won't start, trigger, or schedule +weight: 1 +--- + +# Job Execution Issues + +This guide helps you resolve issues when CI jobs won't start, don't trigger when expected, or stay in pending state. + +## Quick Diagnosis + +| Symptom | Likely Cause | Solution | +|---------|--------------|----------| +| Job doesn't appear on PR | Not configured correctly | [Job Not Triggering](#job-not-triggering) | +| Job stays "Pending" | Resource constraints | [Job Stuck Pending](#job-stuck-pending) | +| "Unknown job" error | Job name mismatch | [Unknown Job](#unknown-job-error) | +| Job runs unexpectedly | Trigger conditions | [Unexpected Execution](#unexpected-job-execution) | + +## Job Not Triggering + +### Problem: Job Doesn't Appear on PR + +**Symptoms**: +- No job status shown on GitHub PR +- `/test` command returns "unknown job" + +**Common causes**: + +1. **Job not generated**: + ```bash + # In openshift/release repo + make jobs + git status # Check for uncommitted changes + ``` + +2. **Wrong branch configuration**: + ```yaml + # Check file naming + ci-operator/config/org/repo/org-repo-main.yaml # For main branch + ci-operator/config/org/repo/org-repo-release-4.15.yaml # For release branch + ``` + +3. **Job is optional**: + ```yaml + tests: + - as: optional-test + optional: true # Won't run automatically + # ... + ``` + - Optional jobs must be triggered manually with `/test optional-test` + +### Problem: Conditional Jobs Not Running + +**Symptoms**: +- Job configured with `run_if_changed` doesn't trigger +- Expected job doesn't run on certain PRs + +**Solutions**: + +1. **Verify path patterns**: + ```yaml + tests: + - as: frontend-tests + run_if_changed: "^frontend/" # Only runs if frontend/ files change + ``` + +2. **Check skip conditions**: + ```yaml + tests: + - as: backend-tests + skip_if_only_changed: "^docs/" # Skips if only docs change + ``` + +3. **Debug with test command**: + ```bash + # Force run regardless of conditions + /test frontend-tests + ``` + +### Problem: Periodic Jobs Not Running + +**Symptoms**: +- Cron job doesn't execute on schedule +- Periodic job never triggers + +**Common issues**: + +1. **Invalid cron syntax**: + ```yaml + tests: + - as: nightly + cron: "0 0 * * *" # Runs at midnight UTC + # Common mistake: using 6 fields instead of 5 + ``` + +2. **Missing interval/cron**: + ```yaml + tests: + - as: periodic-test + interval: 24h # OR use cron, not both + ``` + +## Job Stuck Pending + +### Problem: Job Stays in Pending State + +**Common causes and solutions**: + +1. **Cluster capacity**: + - Check [cluster status](https://prow.ci.openshift.org/) + - Look for banner messages about capacity issues + - Wait and retry later + +2. **Resource requests too high**: + ```yaml + # Reduce resource requests + resources: + requests: + cpu: 500m # Instead of 4 + memory: 2Gi # Instead of 16Gi + ``` + +3. **Specific node requirements**: + ```yaml + # Remove unnecessary node selectors + nodeSelector: + node-role.kubernetes.io/tests: "" # May limit scheduling + ``` + +### Problem: "Could not schedule pod" + +**Error**: +``` +0/10 nodes are available: insufficient cpu +``` + +**Solutions**: + +1. **Check current cluster load**: + - Visit Grafana dashboards + - Look for cluster utilization + +2. **Use a different cluster**: + ```yaml + # In .ci-operator.yaml + cluster: build01 # Try different cluster + ``` + +3. **Reduce parallelism**: + ```yaml + # Limit concurrent test pods + tests: + - as: parallel-tests + steps: + test: + - as: tests + parallelism: 5 # Reduce from default + ``` + +## Unknown Job Error + +### Problem: "/test job-name" Returns Unknown + +**Error**: +``` +@user: The following jobs are not known to prow: job-name +``` + +**Debugging steps**: + +1. **Verify exact job name**: + ```yaml + # In ci-operator config + tests: + - as: e2e-aws # This is the exact name to use + ``` + - Use: `/test e2e-aws` + - Not: `/test e2e-aws-test` or `/test test-e2e-aws` + +2. **Check job generation**: + ```bash + # Jobs must be generated + cd openshift/release + make jobs + + # Verify job exists + grep -r "name: pull-ci-org-repo-.*-e2e-aws" ci-operator/jobs/ + ``` + +3. **Ensure PR is updated**: + ```bash + # Rebase your PR + git rebase upstream/master + git push --force + ``` + +## Unexpected Job Execution + +### Problem: Job Runs When It Shouldn't + +**Symptoms**: +- Job triggers on unrelated changes +- Periodic job runs too frequently + +**Common causes**: + +1. **Missing `run_if_changed`**: + ```yaml + tests: + - as: expensive-test + # Add condition to limit runs + run_if_changed: "^(cmd|pkg)/" + ``` + +2. **Always run is default**: + ```yaml + tests: + - as: selective-test + always_run: false # Must explicitly set to false + run_if_changed: "^specific-path/" + ``` + +## Rehearsal Jobs + +### Problem: Rehearsal Jobs Failing + +**Symptoms**: +- `ci/prow/rehearse` jobs fail +- Changes to ci-operator config blocked + +**Solutions**: + +1. **Check rehearsal output**: + - Click on failed rehearsal job + - Look for specific errors + - Often indicates breaking changes + +2. **Common rehearsal failures**: + - Renaming jobs (breaks TestGrid) + - Removing required jobs + - Invalid configuration + +3. **Test locally**: + ```bash + # Run rehearsals locally + make rehearse + ``` + +## Job Scheduling Issues + +### Problem: Jobs Not Running in Order + +**Symptoms**: +- Post jobs run before tests complete +- Dependencies not respected + +**Solutions**: + +1. **Use run_after_success**: + ```yaml + - as: publish + postsubmit: true + run_after_success: + - e2e-tests + - unit-tests + ``` + +2. **Chain dependencies correctly**: + ```yaml + chain: + as: sequential-tests + steps: + - ref: first-test # Runs first + - ref: second-test # Runs after first + ``` + +## Debugging Tools + +### View Job Configuration + +```bash +# See generated job config +ci-operator-configresolver -config ci-operator/config/org/repo/config.yaml -print-config + +# Validate job config +ci-operator-checkconfig -config ci-operator/config/org/repo/config.yaml +``` + +### Check Prow Status + +1. **Prow dashboard**: https://prow.ci.openshift.org/ +2. **PR history**: https://prow.ci.openshift.org/pr-history +3. **Job history**: https://prow.ci.openshift.org/job-history + +### Force Job Execution + +```bash +# Trigger specific job +/test job-name + +# Trigger all jobs +/retest + +# Skip specific jobs +/test all skip-job-name +``` + +## Prevention + +### Best Practices + +1. **Test configuration locally**: + ```bash + make validate-config + make jobs + ``` + +2. **Use descriptive job names**: + ```yaml + tests: + - as: e2e-aws-sdn # Clear what it tests + ``` + +3. **Set appropriate triggers**: + ```yaml + tests: + - as: expensive-test + optional: true # Don't run on every PR + ``` + +4. **Document special requirements**: + ```yaml + tests: + - as: special-test + # This test requires manual trigger due to X + optional: true + ``` + +## Getting Help + +If jobs still won't execute: + +1. **Verify basics**: + - Job name matches exactly + - Configuration is valid + - Jobs are generated + +2. **Check Prow status**: + - Look for outage banners + - Check Slack announcements + +3. **Ask for help**: + - Include PR link + - Show `/test` commands tried + - Share job configuration + +## Next Steps + +- [Debugging Failed Jobs]({{< ref "debugging-failed-jobs" >}}) - When jobs run but fail +- [Configuration Issues]({{< ref "configuration-issues" >}}) - For config problems +- [Cluster Problems]({{< ref "cluster-problems" >}}) - For infrastructure issues \ No newline at end of file