Fix Intermittent Test Failures#1301
Conversation
7aa25f8 to
5f91c2c
Compare
There was a problem hiding this comment.
Pull request overview
This pull request improves the reliability of integration tests for SMS and organizational unit (OU) registration flows by addressing intermittent failures that occur in resource-constrained environments. The changes focus on two main areas: increasing timeouts for asynchronous operations and introducing a retry mechanism with exponential backoff.
Changes:
- Added a new
RetryWithBackoffutility function intests/integration/flow/common/utils.goto handle transient failures with exponential backoff - Increased wait times from 500ms-1s to 2 seconds in multiple test cases across
sms_registration_test.goandou_registration_test.goto accommodate slower environments - Modified
generateUniqueMobileNumberto use the fullUnixNanovalue instead of modulo for better uniqueness in parallel test runs
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/integration/flow/common/utils.go | Added RetryWithBackoff utility function for handling transient failures with exponential backoff |
| tests/integration/flow/registration/sms_registration_test.go | Increased SMS wait timeout to 2 seconds and added retry logic with backoff for flow completion in one test case |
| tests/integration/flow/registration/ou_registration_test.go | Increased OTP wait timeout to 2 seconds, added retry logic with backoff for flow completion, and updated mobile number generation to use full UnixNano for better uniqueness |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1301 +/- ##
=======================================
Coverage 89.94% 89.94%
=======================================
Files 629 629
Lines 41116 41116
Branches 2390 2390
=======================================
Hits 36981 36981
Misses 2235 2235
Partials 1900 1900
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
5f91c2c to
d42f664
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a RetryWithBackoff utility and integrates exponential-backoff retries plus longer waits into SMS and OU registration integration tests; also updates a mobile number formatting string in an OU registration test. All changes are within test code and a test utility file. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/integration/flow/registration/ou_registration_test.go`:
- Around line 885-888: generateUniqueMobileNumber currently uses
time.Now().UnixNano() producing >15-digit numbers that violate E.164; change
generateUniqueMobileNumber to produce a country-coded E.164-compliant string by
limiting the unique suffix to at most 10 digits (e.g., use UnixNano() %
10000000000 and format with zero padding so the total digits including the "+1"
country code do not exceed 15). Also update formatPhoneNumber to assert/validate
the final length is <=15 (and return an error or fail the test if not) so the
helper cannot produce non-compliant numbers; reference functions:
generateUniqueMobileNumber and formatPhoneNumber.
🧹 Nitpick comments (3)
tests/integration/flow/common/utils.go (1)
503-524: Consider handling edge case whenmaxRetriesis 0 or negative.If
maxRetriesis 0, the loop never executes, and the function returnsnilwithout ever calling the operation. This could silently skip operations if a caller passes an invalid value.🛠️ Suggested defensive check
func RetryWithBackoff(operation func() error, maxRetries int, initialDelay time.Duration) error { + if maxRetries <= 0 { + return fmt.Errorf("maxRetries must be positive, got %d", maxRetries) + } var lastErr error delay := initialDelaytests/integration/flow/registration/ou_registration_test.go (1)
824-846: Retry logic returns success on ERROR status, which may confuse readers.When
flowStep.FlowStatus == "ERROR", the closure returnsnil, signaling success toRetryWithBackoff. The test then fails at the assertion on line 846. While functionally correct, this pattern is non-obvious since returningniltypically means "operation succeeded."Consider returning a distinguishable error or using a different pattern to make the intent clearer.
♻️ Alternative approach using a sentinel error
+var errPermanentFailure = fmt.Errorf("permanent failure, stop retrying") + err = common.RetryWithBackoff(func() error { flowStep, err = common.CompleteFlow(flowStep.FlowID, inputs, "action_001") if err != nil { return err } // Don't retry on ERROR status - these are permanent errors if flowStep.FlowStatus == "ERROR" { - return nil // Stop retrying, let the test assertion handle it + return errPermanentFailure } if flowStep.FlowStatus != "INCOMPLETE" { return fmt.Errorf("unexpected flow status: %s", flowStep.FlowStatus) } return nil }, 3, 1*time.Second) - ts.Require().NoError(err) + if err != nil && err.Error() != errPermanentFailure.Error() { + ts.Require().NoError(err) + }tests/integration/flow/registration/sms_registration_test.go (1)
551-567: Variableerrfrom outer scope is reassigned inside the retry closure.Line 555 reassigns
errfrom the outer scope inside the closure. While this works in Go, it's error-prone and reduces readability. Consider using a distinct variable name inside the closure.♻️ Suggested clarification
var otpStep *common.FlowStep err = common.RetryWithBackoff(func() error { - otpStep, err = common.CompleteFlow(flowStep.FlowID, inputs, "action_001") - if err != nil { - return err + var completeErr error + otpStep, completeErr = common.CompleteFlow(flowStep.FlowID, inputs, "action_001") + if completeErr != nil { + return completeErr } // Don't retry on ERROR status - these are permanent errors if otpStep.FlowStatus == "ERROR" { return nil // Stop retrying, let the test assertion handle it } if otpStep.FlowStatus != "INCOMPLETE" { return fmt.Errorf("unexpected flow status: %s", otpStep.FlowStatus) } return nil }, 3, 1*time.Second)
d42f664 to
a15a03a
Compare
a15a03a to
d94085c
Compare
d94085c to
dae668a
Compare
dae668a to
5bc4cd6
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/integration/flow/common/utils.go`:
- Around line 503-524: Add unit tests exercising RetryWithBackoff: one test
where the supplied operation fails a fixed number of times then succeeds (verify
it returns nil and is called expected attempts), and one where it always fails
(verify the returned error wraps the final operation error and indicates
maxRetries). Use small initialDelay (e.g., 1ms) to keep tests fast, drive the
operation with a counter/closure to control when it succeeds, and assert call
counts and error wrapping; reference the RetryWithBackoff function, its
maxRetries and initialDelay parameters to locate where to invoke it.
| // RetryWithBackoff retries an operation with exponential backoff | ||
| // This helps handle transient failures that may occur under resource constraints | ||
| func RetryWithBackoff(operation func() error, maxRetries int, initialDelay time.Duration) error { | ||
| var lastErr error | ||
| delay := initialDelay | ||
|
|
||
| for attempt := 0; attempt < maxRetries; attempt++ { | ||
| lastErr = operation() | ||
| if lastErr == nil { | ||
| return nil | ||
| } | ||
|
|
||
| // Don't sleep after the last attempt | ||
| if attempt < maxRetries-1 { | ||
| time.Sleep(delay) | ||
| // Exponential backoff: double the delay for next attempt | ||
| delay *= 2 | ||
| } | ||
| } | ||
|
|
||
| return fmt.Errorf("operation failed after %d attempts: %w", maxRetries, lastErr) | ||
| } |
There was a problem hiding this comment.
Add tests for RetryWithBackoff to meet the 80% coverage requirement.
This helper adds non‑trivial retry behavior but has no dedicated coverage (e.g., success-after-N attempts and failure-after-max). Please add unit/integration tests to keep the required threshold.
As per coding guidelines, "Add unit and integration tests for each new feature or bug fix to achieve combined coverage of at least 80%".
🤖 Prompt for AI Agents
In `@tests/integration/flow/common/utils.go` around lines 503 - 524, Add unit
tests exercising RetryWithBackoff: one test where the supplied operation fails a
fixed number of times then succeeds (verify it returns nil and is called
expected attempts), and one where it always fails (verify the returned error
wraps the final operation error and indicates maxRetries). Use small
initialDelay (e.g., 1ms) to keep tests fast, drive the operation with a
counter/closure to control when it succeeds, and assert call counts and error
wrapping; reference the RetryWithBackoff function, its maxRetries and
initialDelay parameters to locate where to invoke it.
| flowStep, err = common.CompleteFlow(flowStep.FlowID, inputs, "action_001") | ||
| // Retry flow completion with backoff to handle resource constraints | ||
| // Only retry if flow doesn't complete (INCOMPLETE status expected), not on ERROR | ||
| err = common.RetryWithBackoff(func() error { |
There was a problem hiding this comment.
I feel like this is not a good approach to fix this issue. With this, the test tries to retry the failed operation which may/ may not result in a successful status.
Even if it gets successful eventually, IMO this could hide an actual issue with the implementation as well.
There was a problem hiding this comment.
Aren't you able to flag an issue with the dummy http server implementation?
Purpose
This pull request improves the reliability and robustness of integration tests for SMS and organizational unit (OU) registration flows, especially in resource-constrained environments. The main changes include increasing timeouts for waiting on asynchronous operations and introducing a generic retry mechanism with exponential backoff to handle transient failures during test execution.
Test reliability improvements:
Increased timeouts (from 500ms or 1s to 2s) when waiting for OTP/SMS messages to be sent in several test cases in
sms_registration_test.goandou_registration_test.go, reducing flakiness in slower environments. [1] [2] [3] [4]Updated the helper function
generateUniqueMobileNumberto produce more unique numbers by using the fullUnixNanovalue, further reducing the chance of collisions in parallel test runs.Resilience to transient failures:
Added a new
RetryWithBackoffutility inutils.goto retry operations with exponential backoff, improving test resilience to transient resource-related errors.Refactored relevant test cases in
ou_registration_test.goandsms_registration_test.goto useRetryWithBackoffwhen completing flows, ensuring tests only fail on permanent errors and not on temporary issues. [1] [2]Related Issues
Related PRs
Checklist
breaking changelabel added.Security checks
Summary by CodeRabbit