Conversation
Now that DAP-16 allows for job reacquisition / idempotency on the HTTP layer, a job in AwaitingRequest has to be continuable directly without error. E.g., the database needs to include it in `acquire_incomplete_aggregation_jobs`. While I was at it, I wrote some more documentation on the state transitions in the `AggregationJobState` database model. These changes unblock further job state updates for the leader, but do not go so far as to remove the no-longer-relevant AwaitingRequest state, which will be done separately in #4305. Resolves #4322
aggregator_core/src/datastore.rs
Outdated
| -- DAP (§4.6.3 [dap-16]) describes the continuation phase where the Leader sends | ||
| -- AggregationJobContinueReq messages to advance preparation. Helper jobs in | ||
| -- AWAITING_REQUEST state represent work that has processed one round but needs | ||
| -- additional Leader requests to complete. These jobs must be acquirable so the | ||
| -- Helper can respond to incoming continuation requests. | ||
| WHERE aggregation_jobs.state IN ('ACTIVE', 'AWAITING_REQUEST') |
There was a problem hiding this comment.
I don't think this change makes sense for helper tasks. The acquire_incomplete_aggregation_jobs() method is only used in the AggregationJobDriver, when fetching leases on aggregation jobs to process. When it steps an aggregation job, it will eventually set the aggregation job state to AwaitingRequest, in either step_aggregation_job_helper_init() or step_aggregation_job_helper_continue(). I'm not sure what these routines would do if they run on the same set of report aggregations, but assuming they don't return an error and eventually mark the job as abandoned, the same aggregation jobs would still be eligible to be picked up by this query again, which would impair the liveness of aggregation jobs in the same task.
This method does not directly affect how the HTTP route handler works, so we shouldn't need to change it at all for request idempotency reasons. Rather, that code path would fetch an individual aggregation job by its identifiers.
| /// phase (§4.6.3 [dap-16]). The Helper has sent an AggregationJobResp but some reports | ||
| /// remain in the Continued state awaiting the next AggregationJobContinueReq. | ||
| /// | ||
| /// Note: This state will be removed as part of the DAP-16 state model redesign (#4305). |
There was a problem hiding this comment.
I think we will need two states like Active and AwaitingRequest for at least some of our modes of operation, particularly when operating as the helper and doing asynchronous processing. We need to hop back and forth between them as we wait to do computationally expensive work in the aggregation job driver or wait to get polled after finishing that work, until we finish all VDAF rounds. Note that we also need to do this when handling the aggregation job initialization request, not just the continuation request. The aggregation job driver needs some way of efficiently finding jobs that are ready for it to process, and we currently achieve that with the aggregation_jobs_state_and_lease_expiry partial index.
@divergentdave has corrected my interpretation of the state machine.
| /// Job is being actively processed. This is the initial state for both Leader and Helper | ||
| /// aggregation jobs. Corresponds to the initialization phase in DAP (§4.6.2 [dap-16]). |
There was a problem hiding this comment.
This isn't always the initial state. That only holds for Leader aggregation jobs and for Helper aggregation jobs when using the asynchronous aggregation mode. For Helper aggregation jobs when using the synchronous aggregation mode, the aggregation job starts in either AwaitingRequest or Finished (depending on the number of rounds). This state moreso indicates that the job is ready for the aggregation job driver to pick up.
tgeoghegan
left a comment
There was a problem hiding this comment.
I see how this PR adds aggregation job state transitions in step_aggregation_job_helper_init and step_aggregation_job_helper_continue. But I don't see the corresponding change to make the aggregation_job_writer stop evaluating those state changes. Should WriteState::update_aggregation_job_state_from_report_aggregations be changed?
|
This has been trimmed down to just the documentation updates. |
tgeoghegan
left a comment
There was a problem hiding this comment.
Some wording nits to ponder
| /// corresponds to the AGGREGATION_JOB_STATE enum in the schema. | ||
| /// | ||
| /// These are implementation-specific states used for Janus's internal state management. | ||
| /// DAP §4.6 [dap-16] defines aggregation job completion in terms of individual report |
There was a problem hiding this comment.
nit: could make this a link to draft 16 in the Datatracker
There was a problem hiding this comment.
Could, but that's verbose and mostly I'm trying to add easy-to-grep tags for our future use than to live in a hypertext utopia.
| #[derive(Copy, Clone, Debug, Hash, PartialEq, Eq, ToSql, FromSql)] | ||
| #[postgres(name = "aggregation_job_state")] | ||
| pub enum AggregationJobState { | ||
| /// Job is ready for the aggregation job driver to pick up. Corresponds to the |
There was a problem hiding this comment.
nit: "pick up" is perhaps ambiguous. What we mean is that the job is ready for the aggregation job driver to run, or to drive, right?
There was a problem hiding this comment.
Had to use drive since you basically gave me permission. :)
| #[postgres(name = "AWAITING_REQUEST")] | ||
| AwaitingRequest, | ||
| /// All report aggregations have reached a terminal state (Finished or Failed), completing | ||
| /// the aggregation job lifecycle (§4.6 [dap-16]). Output shares are committed to batch |
There was a problem hiding this comment.
| /// the aggregation job lifecycle (§4.6 [dap-16]). Output shares are committed to batch | |
| /// the aggregation job lifecycle (§4.6 [dap-16]). Output shares have been committed to batch |
I think the tense matters in that if we see an aggregation job in state Finished, output shares from its constituent report aggregations have been, at some point in time previous to when the Finished job is observed, computed and committed, which has implications for handling of subsequent aggregation jobs. The tense "are committed" leaves it unclear when the commitment happens (perhaps the entity observing the Finished state is expected to do so?)
| /// Job has been marked for deletion and should not be processed further. This is a terminal | ||
| /// state used during cleanup. |
There was a problem hiding this comment.
By cleanup, do we mean garbage collection? Or is this also possible if someone sends DELETE /tasks/{task-id}/aggregation_jobs/{agg-job-id}? That's an honest question, I can't remember. Anyway, if it's just those two, we could afford a few more words here explaining how a job enters this state.
There was a problem hiding this comment.
/// Job has been marked for deletion, either by garbage collection or by using the HTTP
/// DELETE endpoint, and should not be processed further. This is a terminal state used
/// during cleanup.
I wrote some more documentation on the state transitions in the
AggregationJobStatedatabase model.Resolves #4322