Skip to content

Conversation

@XploY04
Copy link
Collaborator

@XploY04 XploY04 commented Oct 2, 2025

This PR introduces a production-ready chaos testing framework for CloudNativePG clusters, combining Jepsen (formal consistency verification) with Litmus Chaos (pod deletion) to provide mathematical proof of data integrity under failure conditions.

Part of LFX Mentorship Program 2025/3 to strengthen CloudNativePG resilience through rigorous testing.

Closes #2

What's Included

chaos-testing/
├── README.md                          # Comprehensive documentation (1300+ lines)
├── pg-eu-cluster.yaml                 # PostgreSQL cluster config
├── litmus-rbac.yaml                   # Chaos permissions
├── experiments/
│   └── cnpg-jepsen-chaos.yaml         # Combined Jepsen + chaos test
├── workloads/
│   ├── jepsen-cnpg-job.yaml           # Jepsen test job
│   └── jepsen-results-pvc.yaml        # Result storage
└── scripts/
    ├── run-jepsen-chaos-test.sh       # Main orchestration
    ├── monitor-cnpg-pods.sh           # Real-time monitoring
    └── get-chaos-results.sh           # Result extraction

Test Workflow

  1. Deploy CloudNativePG cluster (1 primary + 2 replicas)
  2. Start Jepsen workload (continuous read/write operations)
  3. Inject chaos (delete primary pod every 180s)
  4. CloudNativePG performs automatic failover
  5. Jepsen continues operations throughout chaos
  6. Elle checker analyzes history for consistency violations
  7. Generate verdict: :valid? true (PASS) or :valid? false (FAIL)

Quick Start

# Deploy PostgreSQL cluster
kubectl apply -f pg-eu-cluster.yaml
kubectl wait --for=condition=ready cluster/pg-eu --timeout=300s

# Configure chaos RBAC
kubectl apply -f litmus-rbac.yaml

# Run 5-minute test
./scripts/run-jepsen-chaos-test.sh

# Check results
grep ":valid?" logs/jepsen-chaos-*/results/results.edn
cat logs/jepsen-chaos-*/STATISTICS.txt

Expected Results

Successful Test

{:valid? true
 :anomaly-types []
 :not #{}}

Statistics

Total :ok     : 14,523  (97.03%)  # Successful operations
Total :fail   : 445     (2.97%)   # Expected during failover
Total :info   : 0       (0.00%)   # Indeterminate operations

Result Files

  • results.edn - Consistency verdict (:valid? true/false)
  • history.edn - Complete operation log (3-6 MB)
  • timeline.html - Interactive visualization
  • STATISTICS.txt - High-level summary
  • jepsen.log - Full test execution logs

Technical Highlights

Dynamic Pod Targeting

TARGETS: "deployment:default:[cnpg.io/cluster=pg-eu,cnpg.io/instanceRole=primary]:intersection"

Comprehensive Probes

  • Start-of-test: Cluster health check
  • End-of-test: Cluster recovery verification
  • Continuous: Replication lag monitoring (Prometheus)
  • Continuous: Primary availability check
  • Continuous: Cluster status validation

Configurable Parameters

  • Test duration: 60s to 1800s+
  • Chaos interval: 30s to 300s
  • Operation rate: 10-100+ ops/sec
  • Concurrency: 1-20 workers
  • Isolation levels: read-committed, repeatable-read, serializable

Testing Performed

  • ✅ Baseline tests (no chaos)
  • ✅ Primary pod deletion
  • ✅ Multiple failover cycles
  • ✅ Different isolation levels
  • ✅ Various chaos intervals

Future Enhancements

  • Network partition testing
  • Storage failure simulation
  • Multi-cluster testing
  • Performance regression detection
  • CI/CD pipeline templates

Signed-off-by: XploY04 2004agarwalyash@gmail.com

- Updated README.md with prerequisites, environment setup, and chaos experiment instructions.
- Created EXPERIMENT-GUIDE.md for detailed chaos experiment execution and monitoring.
- Added YAML files for chaos experiments: cnpg-primary-pod-delete.yaml, cnpg-random-pod-delete.yaml, and cnpg-replica-pod-delete.yaml.
- Implemented Litmus RBAC configuration in litmus-rbac.yaml.
- Configured PostgreSQL cluster in pg-eu-cluster.yaml.
- Developed scripts for environment verification (check-environment.sh) and chaos results retrieval (get-chaos-results.sh).
- Enhanced status check script (status-check.sh) for Litmus installation verification.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…TARGET_PODS

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
… updating documentation. Added support for chaos experiments without hard-coded pod names, improved README and quick start guides, and introduced monitoring scripts for better visibility during chaos experiments.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…a consistency verification

- Implemented `setup-cnp-bench.sh` for configuring cnp-bench with detailed instructions for benchmarking CloudNativePG.
- Created `setup-prometheus-monitoring.sh` to apply PodMonitor configurations for Prometheus metrics scraping.
- Developed `verify-data-consistency.sh` to check data integrity after chaos experiments, including various consistency tests.
- Added `pgbench-continuous-job.yaml` for running continuous pgbench workloads during chaos testing, with options for custom workloads.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
… Prometheus

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…for consistency in chaos experiments

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
- Created a Kubernetes Job definition for running the Jepsen PostgreSQL consistency test against a CloudNativePG cluster.
- The job includes environment variables for configuration, command execution for testing, and result handling.
- Added a PersistentVolumeClaim for storing Jepsen test results with a request for 2Gi of storage.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
@gbartolini
Copy link
Contributor

Please use the latest version of the operator in the installation instructions:

kubectl apply -f \
  https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.20/releases/cnpg-1.20.0.yaml

…oring enhancements

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
- Implemented a comprehensive bash script to orchestrate Jepsen consistency testing with chaos experiments.
- The script includes pre-flight checks, database cleanup, PVC management, Jepsen job deployment, chaos experiment application, and result extraction.
- Added logging functionality with color-coded output for better readability.
- Integrated error handling and cleanup procedures to ensure graceful exits and resource management.
- Provided detailed usage instructions and exit codes for user guidance.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…ration

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…ting

- Introduced a new ChaosEngine configuration () for running Jepsen tests without Prometheus probes, allowing for chaos testing in environments lacking monitoring.
- Updated existing  to remove unnecessary probe configurations and ensure compatibility with the new no-probes variant.
- Modified  to include a Service definition for metrics collection and changed PodMonitor to ServiceMonitor for better integration with Prometheus.
- Removed obsolete  and Jepsen job configurations that are no longer needed.
- Deleted scripts for fetching chaos results and monitoring CNPG pods, streamlining the testing process.
- Enhanced  to include namespace and context parameters for improved flexibility.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…d improve primary pod identification logic

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
@XploY04 XploY04 marked this pull request as ready for review November 20, 2025 21:14
@XploY04 XploY04 requested a review from a team as a code owner November 20, 2025 21:14
… replication monitoring

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
README.md Outdated
>
> ```bash
> VERSION="v1.27.1"
> curl -L "https://github.com/cloudnative-pg/cloudnative-pg/releases/download/${VERSION}/kubectl-cnpg_${VERSION}_linux_amd64.tar.gz" -o /tmp/kubectl-cnpg.tar.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command line doesn't work is not v1.27.1 for the version and amd64 isn't in the binary to download

README.md Outdated

```bash
# Re-export the playground kubeconfig if you opened a new shell
export KUBECONFIG=/path/to/cnpg-playground/k8s/kube-config.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably something like

export KUBECONFIG=$PWD/k8s/kube-config.yaml

Since you already inside the cnpg-playground directory

README.md Outdated
Comment on lines 79 to 85
# Apply the 1.27.1 operator manifest exactly as documented
kubectl apply --server-side -f \
https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.27/releases/cnpg-1.27.1.yaml

# Alternatively, generate a custom manifest via the kubectl cnpg plugin
kubectl cnpg install generate --control-plane \
| kubectl apply --context kind-k8s-eu -f - --server-side
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will keep one or the other, but not both, it's confusing my first question was "Why I'm installing the same twice?"

> Follow these sections in order; each references the authoritative upstream documentation to keep this README concise.
### 1. Bootstrap the CNPG Playground
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this test it's important to mention that it's required to increase the max open files otherwise it's not working, this is on the playground IIRC

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think about this. I have added that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where was this added ? Because I can't see it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Increase max open files limit if needed (required for Jepsen on some systems):
    ulimit -n 65536

In Prerequisites before running the script.

README.md Outdated
kubectl apply -n litmus -f https://hub.litmuschaos.io/api/chaos/master?file=faults/kubernetes/pod-delete/fault.yaml

# OR install from local file (if you need customization)
kubectl apply -n litmus -f chaosexperiments/pod-delete-cnpg.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failed with the following error:

error: the namespace from the provided object "default" does not match the namespace "litmus". You must pass '--namespace=default' to perform this operation.

Comment on lines 182 to 183
# Watch the chaos runner pod start (refreshes every 2s)
watch -n2 'kubectl -n litmus get pods | grep cnpg-jepsen-chaos-noprobes-runner'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I do what? here is not clear if I should exit or not at some point, I'm guessing yes

README.md Outdated
Comment on lines 188 to 190
# Check experiment logs to see pod deletions (ensure a pod exists first)
runner_pod=$(kubectl -n litmus get pods -l chaos-runner-name=cnpg-jepsen-chaos-noprobes -o jsonpath='{.items[0].metadata.name}') && \
kubectl -n litmus logs -f "$runner_pod"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failed with the following error

runner_pod=$(kubectl -n litmus get pods -l chaos-runner-name=cnpg-jepsen-chaos-noprobes -o jsonpath='{.items[0].metadata.name}') && \
kubectl -n litmus logs -f "$runner_pod"
error: error executing jsonpath "{.items[0].metadata.name}": Error executing template: array index out of bounds: index 0, length 0. Printing more information for debugging the template:
	template was:
		{.items[0].metadata.name}
	object given to jsonpath engine was:
		map[string]interface {}{"apiVersion":"v1", "items":[]interface {}{}, "kind":"List", "metadata":map[string]interface {}{"resourceVersion":""}}

README.md Outdated
Expose the CNPG metrics port (9187) through a dedicated Service + ServiceMonitor bundle, then verify Prometheus scrapes it. Manual management keeps you aligned with the operator deprecation of `spec.monitoring.enablePodMonitor` and dodges the PodMonitor regression in kube-prometheus-stack v79 where CNPG pods only advertise the `postgresql` and `status` ports:

```bash
kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This throw the following error

 kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
Warning: resource namespaces/monitoring is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
namespace/monitoring configured


> **Note:** Keep using `experiments/cnpg-jepsen-chaos-noprobes.yaml` until Section 5 installs Prometheus/Grafana. Once monitoring is online, switch to `experiments/cnpg-jepsen-chaos.yaml` (probes enabled) for full observability.
### 5. Configure monitoring (Prometheus + Grafana)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now on this step I run out of space, having 20G of space, this should be clarified, my test just stop here because of running out of space

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this now.

…, refine Jepsen prerequisites, and improve various command examples.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…refine chaos result summary output.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
… corresponding setup instructions.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…p and removing optional Litmus UI and advanced CNPG install details.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
README.md Outdated
export KUBECONFIG=$PWD/k8s/kube-config.yaml
kubectl config use-context kind-k8s-eu

# Apply the 1.27.1 operator manifest exactly as documented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you have the plugin installed, I would rely on the plugin to install the latest version of the operator. See https://github.com/cloudnative-pg/cnpg-playground/blob/main/demo/setup.sh#L65

…est runner with EOT probe checks, and streamline cluster credential handling.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…the for the latest version.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
- Change schedule from Sunday 2 AM UTC to 2 PM Italy time (13:00 UTC)
- Add pull_request trigger for main and dev-2 branches
- Makes workflow visible in Actions tab for manual triggering

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

---

### `setup-prometheus`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please revisit this with the new feature for monitoring directly introduced in the cnpg-playground?

See: https://github.com/cloudnative-pg/cnpg-playground?tab=readme-ov-file#monitoring-with-prometheus-and-grafana

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the necessary changes.

…namespace, and switch to PodMonitor for CNPG metrics.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…ent probes.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…bleshooting steps.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
With CNPG 1.28 there is no need to specify the TCP timeout for standbys.

I have removed the two terminal story.

Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
@gbartolini
Copy link
Contributor

I have reviewed the README file, making some minor adjustments, and followed the instructions.

I am not an expert in Jepsen, but I was able to run it and the diagrams made sense.

In the instructions that you shared at the end of the script though there are some aspects that I wasn't able to completely review. In particular:

  • I had no SUMMARY.txt file after Jepsen's run
  • The results.edn file was empty

I am merging the patch, but I would like you @XploY04 to revisit that part after.

…rlencode

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…ion step

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Updated README to reflect changes in chaos testing workflows and prerequisites.

Signed-off-by: Yash Agarwal <2004agarwalyash@gmail.com>
…esults

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: Yash Agarwal <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
@gbartolini gbartolini merged commit b97635d into main Dec 11, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Initial pilot for chaos testing project

3 participants