Skip to content

ncps does not handle NAR streaming in a multi-instance HA setup #660

@kalbasit

Description

@kalbasit

Problem statement

In a multi-instance HA setup with Redis locks:

  • Instance A starts downloading a large NAR file from upstream
  • Instance B receives a client request for the same NAR file
  • Instance B acquires the Redis download lock and sees the download is in progress
  • Instance B cannot stream from Instance A's in-memory download state
  • Instance B must wait for Instance A to complete and store the file
  • For large files, this wait exceeds client timeouts (HTTP 200 with curl timeout errors)

Extracted from closed #618:

We're seeing this as well, and not even with heavy loads, sometimes just with a single build. It does not appear to be resolved in v0.7.3.

I've dropped our ncps deployment back to 1 instance to see if that helps. FYI, we're using:

  • PostgreSQL
  • NVMe-backed S3 storage via Ceph RGW
  • Redis locking via DragonflyDB

Originally posted by @dhess in #618

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions