Skip to content

DAP-16 Job State Documentation Updates#4323

Open
jcjones wants to merge 6 commits intomainfrom
jcj/4322-dap16-helper-job-state-updates
Open

DAP-16 Job State Documentation Updates#4323
jcjones wants to merge 6 commits intomainfrom
jcj/4322-dap16-helper-job-state-updates

Conversation

@jcjones
Copy link
Contributor

@jcjones jcjones commented Feb 4, 2026

I wrote some more documentation on the state transitions in the AggregationJobState database model.

Resolves #4322

Now that DAP-16 allows for job reacquisition / idempotency on the HTTP layer, a job in
AwaitingRequest has to be continuable directly without error. E.g., the database needs
to include it in `acquire_incomplete_aggregation_jobs`.

While I was at it, I wrote some more documentation on the state transitions in the
`AggregationJobState` database model.

These changes unblock further job state updates for the leader, but do not go so far as to
remove the no-longer-relevant AwaitingRequest state, which will be done separately in #4305.

Resolves #4322
@jcjones jcjones marked this pull request as ready for review February 4, 2026 18:39
@jcjones jcjones requested a review from a team as a code owner February 4, 2026 18:39
Comment on lines 1997 to 2002
-- DAP (§4.6.3 [dap-16]) describes the continuation phase where the Leader sends
-- AggregationJobContinueReq messages to advance preparation. Helper jobs in
-- AWAITING_REQUEST state represent work that has processed one round but needs
-- additional Leader requests to complete. These jobs must be acquirable so the
-- Helper can respond to incoming continuation requests.
WHERE aggregation_jobs.state IN ('ACTIVE', 'AWAITING_REQUEST')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change makes sense for helper tasks. The acquire_incomplete_aggregation_jobs() method is only used in the AggregationJobDriver, when fetching leases on aggregation jobs to process. When it steps an aggregation job, it will eventually set the aggregation job state to AwaitingRequest, in either step_aggregation_job_helper_init() or step_aggregation_job_helper_continue(). I'm not sure what these routines would do if they run on the same set of report aggregations, but assuming they don't return an error and eventually mark the job as abandoned, the same aggregation jobs would still be eligible to be picked up by this query again, which would impair the liveness of aggregation jobs in the same task.

This method does not directly affect how the HTTP route handler works, so we shouldn't need to change it at all for request idempotency reasons. Rather, that code path would fetch an individual aggregation job by its identifiers.

/// phase (§4.6.3 [dap-16]). The Helper has sent an AggregationJobResp but some reports
/// remain in the Continued state awaiting the next AggregationJobContinueReq.
///
/// Note: This state will be removed as part of the DAP-16 state model redesign (#4305).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will need two states like Active and AwaitingRequest for at least some of our modes of operation, particularly when operating as the helper and doing asynchronous processing. We need to hop back and forth between them as we wait to do computationally expensive work in the aggregation job driver or wait to get polled after finishing that work, until we finish all VDAF rounds. Note that we also need to do this when handling the aggregation job initialization request, not just the continuation request. The aggregation job driver needs some way of efficiently finding jobs that are ready for it to process, and we currently achieve that with the aggregation_jobs_state_and_lease_expiry partial index.

@divergentdave has corrected my interpretation of the state machine.
Comment on lines 512 to 513
/// Job is being actively processed. This is the initial state for both Leader and Helper
/// aggregation jobs. Corresponds to the initialization phase in DAP (§4.6.2 [dap-16]).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't always the initial state. That only holds for Leader aggregation jobs and for Helper aggregation jobs when using the asynchronous aggregation mode. For Helper aggregation jobs when using the synchronous aggregation mode, the aggregation job starts in either AwaitingRequest or Finished (depending on the number of rounds). This state moreso indicates that the job is ready for the aggregation job driver to pick up.

Copy link
Contributor

@tgeoghegan tgeoghegan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how this PR adds aggregation job state transitions in step_aggregation_job_helper_init and step_aggregation_job_helper_continue. But I don't see the corresponding change to make the aggregation_job_writer stop evaluating those state changes. Should WriteState::update_aggregation_job_state_from_report_aggregations be changed?

@jcjones jcjones changed the title DAP-16 Helper Job State Updates DAP-16 Job State Documentation Updates Feb 12, 2026
@jcjones
Copy link
Contributor Author

jcjones commented Feb 12, 2026

This has been trimmed down to just the documentation updates.

@jcjones jcjones linked an issue Feb 12, 2026 that may be closed by this pull request
Copy link
Contributor

@tgeoghegan tgeoghegan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some wording nits to ponder

/// corresponds to the AGGREGATION_JOB_STATE enum in the schema.
///
/// These are implementation-specific states used for Janus's internal state management.
/// DAP §4.6 [dap-16] defines aggregation job completion in terms of individual report
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could make this a link to draft 16 in the Datatracker

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could, but that's verbose and mostly I'm trying to add easy-to-grep tags for our future use than to live in a hypertext utopia.

#[derive(Copy, Clone, Debug, Hash, PartialEq, Eq, ToSql, FromSql)]
#[postgres(name = "aggregation_job_state")]
pub enum AggregationJobState {
/// Job is ready for the aggregation job driver to pick up. Corresponds to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "pick up" is perhaps ambiguous. What we mean is that the job is ready for the aggregation job driver to run, or to drive, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to use drive since you basically gave me permission. :)

#[postgres(name = "AWAITING_REQUEST")]
AwaitingRequest,
/// All report aggregations have reached a terminal state (Finished or Failed), completing
/// the aggregation job lifecycle (§4.6 [dap-16]). Output shares are committed to batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// the aggregation job lifecycle (§4.6 [dap-16]). Output shares are committed to batch
/// the aggregation job lifecycle (§4.6 [dap-16]). Output shares have been committed to batch

I think the tense matters in that if we see an aggregation job in state Finished, output shares from its constituent report aggregations have been, at some point in time previous to when the Finished job is observed, computed and committed, which has implications for handling of subsequent aggregation jobs. The tense "are committed" leaves it unclear when the commitment happens (perhaps the entity observing the Finished state is expected to do so?)

Comment on lines 534 to 535
/// Job has been marked for deletion and should not be processed further. This is a terminal
/// state used during cleanup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By cleanup, do we mean garbage collection? Or is this also possible if someone sends DELETE /tasks/{task-id}/aggregation_jobs/{agg-job-id}? That's an honest question, I can't remember. Anyway, if it's just those two, we could afford a few more words here explaining how a job enters this state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    /// Job has been marked for deletion, either by garbage collection or by using the HTTP
    /// DELETE endpoint, and should not be processed further. This is a terminal state used
    /// during cleanup.

Copy link
Contributor

@tgeoghegan tgeoghegan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DAP-16: Rework Helper States DAP-16: Aggregation Job State Updates

3 participants