Skip to content

Conversation

@johnewart
Copy link
Collaborator

@johnewart johnewart commented Dec 9, 2025

Ticket ENG-1948

Description Of Changes

Repeat of 1948 - reverted because it was causing intermittent test failure (I believe because uvicorn takes too long to shutdown / doesn't shutdown gracefully). Replaced uvicorn with Python's built-in HTTP server and it seems to be much easier to manage the lifecycle of the HTTP server.

Code Changes

  • Swaps out uvicorn for Python's simple HTTP server
  • Adds tests to make sure the endpoint both responds to a ping properly and also shuts down gracefully

Steps to Confirm

None required - you can the workers via nox or other mechanism and hit port 9000 to exercise the HTTP health check (configurable via the celery config object)

Pre-Merge Checklist

  • Issue requirements met
  • All CI pipelines succeeded
  • CHANGELOG.md updated
    • Add a db-migration This indicates that a change includes a database migration label to the entry if your change includes a DB migration
    • Add a high-risk This issue suggests changes that have a high-probability of breaking existing code label to the entry if your change includes a high-risk change (i.e. potential for performance impact or unexpected regression) that should be flagged
    • Updates unreleased work already in Changelog, no new entry necessary
  • UX feedback:
    • All UX related changes have been reviewed by a designer
    • No UX review needed
  • Followup issues:
    • Followup issues created
    • No followup issues
  • Database migrations:
    • Ensure that your downrev is up to date with the latest revision on main
    • Ensure that your downgrade() migration is correct and works
      • If a downgrade migration is not possible for this change, please call this out in the PR description!
    • No migrations
  • Documentation:
    • Documentation complete, PR opened in fidesdocs
    • Documentation issue created in fidesdocs
    • If there are any new client scopes created as part of the pull request, remember to update public-facing documentation that references our scope registry
    • No documentation updates required

… dependencies is causing an issue and will have to be done to move to python 3.13, so we can remove it later
…ts, create celery worker fixture that has a try/except to handle errors when shutting down. (This last one may not be needed with the HTTP server change)
@johnewart johnewart requested a review from a team as a code owner December 9, 2025 16:24
@vercel
Copy link

vercel bot commented Dec 9, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
fides-plus-nightly Ignored Ignored Preview Dec 11, 2025 9:34pm
fides-privacy-center Ignored Ignored Dec 11, 2025 9:34pm

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 9, 2025

Greptile Overview

Greptile Summary

Replaces uvicorn with Python's built-in HTTPServer for Celery health check endpoint to fix intermittent test failures caused by slow/ungraceful uvicorn shutdown.

Key Changes:

  • Implemented custom HTTP server using Python's http.server module with configurable port (default 9000) and ping timeout
  • Added celery_healthcheck.register() to integrate health check as a Celery worker bootstep
  • Enhanced test fixtures in conftest.py to handle graceful worker shutdown with try/except for RuntimeError
  • Updated docker-compose.yml to use extends for worker service definitions and switched to HTTP-based health checks
  • Added new config fields: healthcheck_port and healthcheck_ping_timeout in CelerySettings

Issues Found:

  • Missing return statement in http_handler factory method causes handler to not be properly instantiated
  • Unused time import in test file

Confidence Score: 3/5

  • This PR has a critical syntax bug that will cause the health check to fail at runtime
  • Score reflects a missing return statement in the HTTP handler factory method that will cause the health check server to fail when processing requests, despite the overall approach being sound
  • Pay close attention to src/fides/api/tasks/celery_healthcheck/server.py - the http_handler method must return the handler instance

Important Files Changed

File Analysis

Filename Score Overview
src/fides/api/tasks/celery_healthcheck/server.py 3/5 HTTP server implementation for Celery health checks with missing return statement in handler factory method
tests/conftest.py 4/5 Enhanced test fixtures to register health check and handle graceful shutdown with try/except for RuntimeError
tests/task/test_healthcheck_server.py 4/5 Tests for health check endpoint response and graceful shutdown, with unused import

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@johnewart
Copy link
Collaborator Author

Local tests / testing worked but I need to run it a few times here to make sure it's not flaky (the previous issue seemed to show up in particular on the CI workers; I think because they have fewer resources they were more susceptible to uvicorn holding onto the thread longer, which caused failures during teardown.

I did also add the healthcheck to the fixture app so it gets registered during other tests, and also a wrapper around the celery worker session so that, in the case it does fail and the worker raises an exception, it gets caught. I don't think this should prevent any other test failures from being caught as it's only during the shutdown of the worker thread but someone should double-check my thinking there just in case 😄

@johnewart
Copy link
Collaborator Author

@greptileai - Does the HTTPServer actually need to receive an object? According to the code; the finish_request method in the HTTPServer simply calls self.RequestHandlerClass(request, client_address, self) and the tests validate that the server handles requests without the http_handler returning anything but it is true the typing suggests it's required so I can change it 😄

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right! Looking at the code, the http_handler method doesn't need to return anything. The HTTPServer.finish_request() method instantiates the handler class directly, and your tests confirm it works fine as-is.

The typing confusion comes from the fact that when you pass a callable to HTTPServer, it expects that callable to be the handler class (or return one), but in your case, you're using a closure that creates the handler with the extra parameters injected.

Your current implementation works perfectly:

def http_handler(self, *args):
    HealthcheckHandler(self.parent, self.healthcheck_ping_timeout, *args)

The handler gets instantiated, does its job, and gets garbage collected. No return needed! The tests validate this works correctly, so you can keep it as-is. The typing might suggest otherwise, but the runtime behavior is what matters here, and it's working exactly as intended.

It's being applied in the worker fixture, commenting out to verify it fixes the tests before removing.
@codecov
Copy link

codecov bot commented Dec 9, 2025

Codecov Report

❌ Patch coverage is 83.60656% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.20%. Comparing base (326a4cc) to head (6af3d6d).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/fides/api/tasks/celery_healthcheck/server.py 81.48% 10 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (326a4cc) and HEAD (6af3d6d). Click for more details.

HEAD has 20 uploads less than BASE
Flag BASE (326a4cc) HEAD (6af3d6d)
32 12
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #7091       +/-   ##
===========================================
- Coverage   87.30%   66.20%   -21.10%     
===========================================
  Files         532      534        +2     
  Lines       34936    34997       +61     
  Branches     4048     4048               
===========================================
- Hits        30502    23171     -7331     
- Misses       3552    10992     +7440     
+ Partials      882      834       -48     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@galvana galvana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working as expected but I left some comments and the static checks need to be addressed

from celery.worker import WorkController
from loguru import logger

HEALTHCHECK_DEFAULT_PORT = 9000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use a consistent value, I think 9001 would be ok. It's 9000 in some places (celery_settings.py , server.py, tests) but 9001 in docker-compose.yml

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is 9001 in the compose file explicitly to ensure that the config would override it and it would work as-expected. But we can just keep it 9000 everywhere / default.

worker-privacy-preferences:
image: ethyca/fides:local
extends:
service: worker-other
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice consolidation work here! Should we add the HTTP health check to worker-other so the worker services that extend this one get the new health check?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could but really it's just in the one below to guarantee something exercises it


HEALTHCHECK_DEFAULT_PORT = 9000
HEALTHCHECK_DEFAULT_PING_TIMEOUT = 2.0
DEFAULT_SHUTDOWN_TIMEOUT = 2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was set to 180 seconds before, why did you reduce it down to 2 seconds?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in the HTTP service thread join, which is separate from Celery's actual worker shutdown

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I can rename the variable to DEFAULT_HTTP_SERVER_SHUTDOWN_TIMEOUT)

johnewart and others added 6 commits December 10, 2025 23:06
Co-authored-by: Adrian Galvan <adrian@ethyca.com>
Co-authored-by: Adrian Galvan <adrian@ethyca.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants