Fix instance profile leak #4879

jupierce · 2025-12-25T13:55:23Z

No description provided.

The old method of creating a deleting instance profiles was complex and led to a leak which debilitated AWS temporarily. Have all instances share the same instance profile and manage it via cloud formations.

Cloud formation IDs are too difficult to keep in sync with PCO pruning. Retain resources and just delete the lambda explicitly on "destroy"

openshift-ci-robot · 2025-12-25T13:55:25Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

coderabbitai · 2025-12-25T13:55:35Z

Walkthrough

This PR refactors NAT gateway cost reduction infrastructure by adding a multi-account monitoring script, enhancing deployment with AWS profile and region parameters, removing dynamic IAM role provisioning in favor of a pre-existing instance profile, and adding retention policies to CloudFormation resources for safer stack deletion.

Changes

Cohort / File(s)	Summary
Deployment and monitoring orchestration `hack/reduce_nat_gateway_cost/deploy.sh`, `hack/reduce_nat_gateway_cost/monitor_resources.py`	deploy.sh: Added usage function, script directory relocation, AWS_PROFILE and ACTION parameter handling, argument validation, explicit region configuration for all AWS CLI calls, and enhanced destroy logic with Lambda deletion and resource retention notes. monitor_resources.py: New monitoring script with 11+ public functions providing cross-account/cross-region NAT gateway resource checks, including orphaned resource detection, IAM health, NAT instance age analysis, Lambda error metrics, and optional continuous monitoring with alerting.
NAT instance provisioning logic `hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py`	Removed inline IAM policy block, get_iam_client and create_instance_profile functions. Switched from dynamic instance profile creation to using pre-existing NAT_INSTANCE_PROFILE_NAME constant. Added defensive checks for None returns, improved error handling with security group cleanup on failure, instance type iteration retry logic, and enhanced logging for timeouts and missing VPC tags. Updated VERSION from v1.1 to v1.2.
CloudFormation resource configuration `hack/reduce_nat_gateway_cost/use-nat-instance.yaml`, `hack/reduce_nat_gateway_cost/use-nat-instance-forwarders.yaml`	use-nat-instance.yaml: Added NatInstanceRole and NatInstanceProfile resources, applied DeletionPolicy: Retain to five resources (function role, log group, function, execution role, event rule), tightened PassRole IAM policy resource from wildcard to NatInstanceRole ARN, and removed dynamic IAM role management actions. use-nat-instance-forwarders.yaml: Added DeletionPolicy: Retain to UseNatInstanceForwardRole and UseNatInstanceForwardEventRule.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2025-12-25T13:55:48Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jupierce
Once this PR has been reviewed and has the lgtm label, please assign smg247 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (1)
659-740: Enhanced error handling with iterative instance type selection.

The refactored logic properly handles instance type availability by iterating through fallback options and ensures security group cleanup on failures. The approach aligns well with the CloudFormation-managed instance profile strategy.

Consider these improvements based on static analysis:

Line 724: Use logging.exception instead of logging.error to automatically include stack trace.

Lines 728, 738: Replace bare Exception catches with specific exception types or at least log the exception details.
🔎 Proposed refinements
     except Exception as e:
         # If instance creation fails, clean up the security group we created
-        logger.error(f'Failed to create NAT instance: {e}. Cleaning up security group {nat_sg_id}')
+        logger.exception(f'Failed to create NAT instance: {e}. Cleaning up security group {nat_sg_id}')
         try:
             ec2_client.delete_security_group(GroupId=nat_sg_id)
             logger.info(f'Deleted orphaned security group {nat_sg_id}')
-        except Exception as cleanup_error:
+        except botocore.exceptions.ClientError as cleanup_error:
             logger.warning(f'Failed to clean up security group {nat_sg_id}: {cleanup_error}')
         raise
 
     if not nat_instance:
         # All instance types failed - clean up the security group
         logger.error(f'All instance types failed. Cleaning up security group {nat_sg_id}')
         try:
             ec2_client.delete_security_group(GroupId=nat_sg_id)
             logger.info(f'Deleted orphaned security group {nat_sg_id}')
-        except Exception as cleanup_error:
+        except botocore.exceptions.ClientError as cleanup_error:
             logger.warning(f'Failed to clean up security group {nat_sg_id}: {cleanup_error}')
         return None
hack/reduce_nat_gateway_cost/monitor_resources.py (1)
343-466: Well-structured orchestration with good visibility.

The main check function properly coordinates all monitoring checks and provides clear, color-coded output. The aggregation of issues across profiles is well done.

Consider addressing these minor style improvements:

Line 369: The present_resources variable is unpacked but never used

Line 374: The ip_count variable is unpacked but never used

Line 448: The all_healthy variable is assigned but never used
🔎 Proposed refinements
         # Check expected resources exist
-        missing_resources, present_resources = check_expected_resources(profile)
+        missing_resources, _ = check_expected_resources(profile)
         if missing_resources:
             profile_issues.extend([f"  {r}" for r in missing_resources])
         
         # Check instance profile count
-        ip_count, ip_warning = check_instance_profile_count(profile)
+        _, ip_warning = check_instance_profile_count(profile)
         if ip_warning:
             profile_issues.append(ip_warning)
For line 448, either remove the variable or use it:
     # Display infrastructure health summary
     print(f"{CYAN}Infrastructure Health:{NC}")
-    all_healthy = True
     for profile in PROFILES:
         missing, present = check_expected_resources(profile)
         if missing:
             print(f"  {profile}: {RED}UNHEALTHY - {len(missing)} missing resource(s){NC}")
-            all_healthy = False
         else:
             print(f"  {profile}: {GREEN}OK ({len(present)} resources){NC}")
     print()

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 4a066ea and 5bd2a02.

📒 Files selected for processing (5)

hack/reduce_nat_gateway_cost/deploy.sh
hack/reduce_nat_gateway_cost/monitor_resources.py
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
hack/reduce_nat_gateway_cost/use-nat-instance-forwarders.yaml
hack/reduce_nat_gateway_cost/use-nat-instance.yaml

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

hack/reduce_nat_gateway_cost/deploy.sh
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
hack/reduce_nat_gateway_cost/monitor_resources.py
hack/reduce_nat_gateway_cost/use-nat-instance-forwarders.yaml
hack/reduce_nat_gateway_cost/use-nat-instance.yaml

🧬 Code graph analysis (1)

hack/reduce_nat_gateway_cost/monitor_resources.py (1)

hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (1)

main (1061-1081)

🪛 Ruff (0.14.10)

hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py

724-724: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

728-728: Do not catch blind exception: Exception

(BLE001)

738-738: Do not catch blind exception: Exception

(BLE001)

hack/reduce_nat_gateway_cost/monitor_resources.py

49-49: Starting a process with a partial executable path

(S607)

55-55: Starting a process with a partial executable path

(S607)

65-65: Starting a process with a partial executable path

(S607)

86-86: Consider moving this statement to an else block

(TRY300)

256-256: Consider moving this statement to an else block

(TRY300)

369-369: Unpacked variable present_resources is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

374-374: Unpacked variable ip_count is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

401-401: Consider [f"Profile {profile}:", *profile_issues] instead of concatenation

Replace with [f"Profile {profile}:", *profile_issues]

(RUF005)

413-413: Consider ["NAT Instance Age Issues:", *age_issues] instead of concatenation

Replace with ["NAT Instance Age Issues:", *age_issues]

(RUF005)

448-448: Local variable all_healthy is assigned to but never used

Remove assignment to unused variable all_healthy

(F841)

🪛 Shellcheck (0.11.0)

hack/reduce_nat_gateway_cost/deploy.sh

[warning] 6-6: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🔇 Additional comments (31)

hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (8)

118-119: LGTM!

The updated comments accurately reflect the simplified IAM permission model where the Lambda only needs iam:PassRole for the pre-existing CloudFormation-managed instance profile, rather than the broader IAM permissions previously required for dynamic profile creation.

151-154: LGTM!

The constant correctly references the CloudFormation-managed instance profile and aligns with the profile name defined in use-nat-instance.yaml.

278-280: LGTM!

Good defensive coding to handle cases where the NAT gateway may not have been fully tagged during creation.

301-301: LGTM!

Proper guard against None from get_tag prevents potential AttributeError on string operations.

333-333: LGTM!

Proper guard against None from get_tag prevents potential AttributeError on string operations.

500-502: LGTM!

Proper error handling with early return prevents downstream issues when NAT instance creation fails. Security group cleanup is handled within create_nat_instance, ensuring no resource leaks.

644-645: LGTM!

Improved error logging with appropriate severity level and context.

963-963: LGTM!

More specific exception handling improves code clarity and prevents accidentally catching unrelated exceptions.

hack/reduce_nat_gateway_cost/use-nat-instance-forwarders.yaml (2)

7-7: LGTM!

DeletionPolicy: Retain on the IAM role prevents disruption to event forwarding when the CloudFormation stack is deleted. This aligns with the PR's strategy to avoid synchronization issues between CloudFormation and PCO pruning.

37-37: LGTM!

DeletionPolicy: Retain on the EventBridge rule ensures continued event forwarding even when the CloudFormation stack is deleted. This is consistent with the PR's retention policy approach.

hack/reduce_nat_gateway_cost/deploy.sh (3)

8-21: LGTM!

Clear usage instructions and proper argument validation improve the script's usability and safety for multi-account deployments.

24-44: LGTM!

Consistent use of --profile ensures operations target the correct AWS account. The deployment flow is well-structured.

48-66: LGTM!

The destroy logic properly handles the DeletionPolicy: Retain resources by explicitly deleting the Lambda function (the critical component) while providing clear user guidance about retained resources. The approach is safe and well-documented.

hack/reduce_nat_gateway_cost/use-nat-instance.yaml (7)

11-49: LGTM!

The new IAM role and instance profile correctly implement the least-privilege principle for NAT instance operations. The permissions are appropriately scoped:

ec2:ReplaceRoute allows the NAT instance to update route tables

ec2:CreateTags is restricted to route-table resources

ec2:ModifyInstanceAttribute enables disabling source/dest checks

The DeletionPolicy: Retain prevents resource cleanup issues during stack deletion, aligning with the PR's strategy to avoid CloudFormation/PCO pruning synchronization problems.

52-52: LGTM!

Retention policy prevents accidental deletion of the Lambda execution role, maintaining infrastructure stability.

83-83: LGTM!

Narrowing the PassRole permission from wildcard to the specific NatInstanceRole ARN improves security by following the principle of least privilege.

86-86: LGTM!

Retention policy ensures Lambda logs are preserved even after stack deletion.

92-92: LGTM!

Retention policy on the Lambda function is critical - as the deploy script notes, this is explicitly deleted on destroy. Retaining it in CloudFormation prevents accidental deletion while allowing controlled removal.

121-121: LGTM!

Retention policy prevents accidental deletion of the Lambda execution role used by EventBridge.

147-147: LGTM!

Retention policy ensures the EventBridge rule continues forwarding events even after stack deletion.

hack/reduce_nat_gateway_cost/monitor_resources.py (11)

1-41: LGTM!

Well-structured monitoring script with clear configuration constants and proper documentation. The multi-account, multi-region approach is appropriate for the use case.

43-75: LGTM!

The alarm function provides useful feedback with graceful fallbacks across different operating systems. Good user experience.

77-91: LGTM!

Utility functions are well-implemented with proper error handling for VPC existence checks.

93-157: LGTM!

Thorough orphaned resource detection with proper VPC validation. The function correctly returns both issues and NAT instance details for age checking.

159-187: LGTM!

IAM resource leak detection properly identifies the "Created-" prefix pattern that indicates dynamically created resources that should have been cleaned up.

189-206: LGTM!

Instance profile count monitoring with appropriate threshold warning helps prevent hitting AWS service limits.

208-226: LGTM!

NAT instance age checking with proper timezone handling ensures instances aren't left running indefinitely.

228-259: LGTM!

CloudWatch metrics integration for Lambda error tracking is well-implemented with appropriate error handling.

261-269: LGTM!

Simple region-wise counting provides useful visibility into resource distribution.

271-341: LGTM!

Comprehensive resource existence checks across Lambda, IAM, and EventBridge resources with proper error handling for different AWS service exceptions.

468-515: LGTM!

Clean CLI interface with appropriate argument parsing and proper signal handling for continuous monitoring mode.

coderabbitai · 2025-12-25T14:00:40Z

hack/reduce_nat_gateway_cost/deploy.sh

+# Change to the directory where this script is located
+cd "$(dirname "$0")"


⚠️ Potential issue | 🟡 Minor

Add error handling for directory change.

The cd command can fail if the directory doesn't exist or is inaccessible, which would cause the script to continue in the wrong directory and potentially deploy to the wrong location.

🔎 Proposed fix

# Change to the directory where this script is located -cd "$(dirname "$0")" +cd "$(dirname "$0")" || exit 1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Change to the directory where this script is located

cd "$(dirname "$0")"

# Change to the directory where this script is located

cd "$(dirname "$0")" || exit 1

🧰 Tools

🪛 Shellcheck (0.11.0)

[warning] 6-6: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents

In hack/reduce_nat_gateway_cost/deploy.sh around lines 5-6, the script blindly runs cd "$(dirname "$0")" which can fail; update the script to check the result of the cd and abort on failure: attempt to change to the script directory, and if cd returns a non-zero status print a descriptive error to stderr and exit with a non-zero status to avoid continuing in the wrong directory.

openshift-ci · 2025-12-25T15:01:03Z

@jupierce: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/format	`5bd2a02`	link	true	`/test format`
ci/prow/codegen	`5bd2a02`	link	true	`/test codegen`
ci/prow/unit	`5bd2a02`	link	true	`/test unit`
ci/prow/validate-vendor	`5bd2a02`	link	true	`/test validate-vendor`
ci/prow/checkconfig	`5bd2a02`	link	true	`/test checkconfig`
ci/prow/frontend-checks	`5bd2a02`	link	true	`/test frontend-checks`
ci/prow/breaking-changes	`5bd2a02`	link	false	`/test breaking-changes`
ci/prow/integration	`5bd2a02`	link	true	`/test integration`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

jupierce added 2 commits December 24, 2025 11:22

Use cloud formation for static instance profile

9d0315e

The old method of creating a deleting instance profiles was complex and led to a leak which debilitated AWS temporarily. Have all instances share the same instance profile and manage it via cloud formations.

Retain resources on cloud formation deletion

5bd2a02

Cloud formation IDs are too difficult to keep in sync with PCO pruning. Retain resources and just delete the lambda explicitly on "destroy"

openshift-ci bot requested review from Prucek and bear-redhat December 25, 2025 13:55

coderabbitai bot reviewed Dec 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix instance profile leak #4879

Fix instance profile leak #4879

Uh oh!

jupierce commented Dec 25, 2025

Uh oh!

openshift-ci-robot commented Dec 25, 2025

Uh oh!

coderabbitai bot commented Dec 25, 2025 •

edited

Loading

Uh oh!

openshift-ci bot commented Dec 25, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 25, 2025

Uh oh!

openshift-ci bot commented Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Change to the directory where this script is located
		cd "$(dirname "$0")"

Fix instance profile leak #4879

Are you sure you want to change the base?

Fix instance profile leak #4879

Uh oh!

Conversation

jupierce commented Dec 25, 2025

Uh oh!

openshift-ci-robot commented Dec 25, 2025

Uh oh!

coderabbitai bot commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci bot commented Dec 25, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Dec 25, 2025 •

edited

Loading