Skip to content

Race in add_job/add_jobs with keyed jobs can drop scheduling requests (and returns no row under contention) #580

@leo91000

Description

@leo91000

Summary

I am the maintainer of graphile_worker_rs, a Rust rewrite of graphile/worker.

A user recently reported a bug, and I think this exposes a race condition in the keyed scheduling path (add_jobs / add_job).

Original report: leo91000/graphile_worker_rs#378

Steps to reproduce

  1. Start PostgreSQL and initialize Graphile Worker schema.
  2. Run a worker with concurrency 10 and two tasks:
  • printer: very short task (e.g. sleeps ~2ms) so keyed jobs frequently become locked/running.
  • scheduler: loops many times (e.g. 100), each time calling addJob("printer", { key }, { jobKey: key, jobKeyMode: "preserve_run_at" }) with key chosen from a small keyspace (e.g. 10 keys).
  1. Enqueue multiple scheduler jobs concurrently (e.g. 4).
  2. Let it run for ~30-60 seconds.

This creates high contention on the same jobKey while some conflicting rows are locked.

Expected results

  • addJob(...) should always return a valid job row.
  • For replace / preserve_run_at, scheduling should not occasionally return “no row” under contention.

Actual results

Under contention, graphile_worker.add_jobs(...) can return no row for a spec because of:

  • ON CONFLICT (key) DO UPDATE ... WHERE jobs.locked_at IS NULL

When that WHERE condition is false, the conflict path does nothing and returns nothing for that spec.
Then add_job(...) (which selects from add_jobs(...) LIMIT 1) can return a null/empty row.

In strict clients this surfaces clearly (example from Rust/sqlx):

  • error occurred while decoding column "id": unexpected null; try decoding as an Option

In JS this can manifest as rows[0] missing from add_job(...) result in edge cases.

Additional context

  • Reproduced from issue https://github.com/leo91000/graphile_worker_rs/issues/378
  • Reproduced against upstream SQL shape in sql/000018.sql:
  • add_job delegates to add_jobs (select * into v_job from ...add_jobs(...))
  • add_jobs uses ON CONFLICT (key) DO UPDATE ... WHERE jobs.locked_at is null
  • Current repo version checked: 0.17.0-rc.0 (from package.json)
  • PostgreSQL: reproduced on Docker Postgres (15/16 class behavior)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions