-
Notifications
You must be signed in to change notification settings - Fork 41
Aggregation broadcast & unskip interop tests #385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
β¦, and attestation propagation
The aggregated attestation pipeline was broken at multiple points, preventing finalization in multi-node setups: - Add missing GossipAggregatedAttestationEvent to network events - Add AGGREGATED_ATTESTATION decoding and dispatch in event sources - Fix SyncService.publish_aggregated_attestation to use a callback instead of a missing method on NetworkRequester - Wire publish callback directly in Node.from_genesis - Publish aggregates from ChainService._initial_tick (was discarded) - Enable test_late_joiner_sync with is_aggregator=True on node 0
- Unpack (store, aggregates) tuple from on_tick and aggregate_committee_signatures in fork choice fill framework - Update attestation target selection tests for the +1 safe_target allowance introduced in get_attestation_target
- Remove @pytest.mark.skip from test_mesh_finalization and test_mesh_2_2_2_finalization - Update store attribute references: latest_new_attestations -> latest_new_aggregated_payloads, latest_known_attestations -> latest_known_aggregated_payloads - test_partition_recovery remains xfail (known sync service limitation)
- Fix test_store_attestations indentation: steps 2-3 were inside for loop, causing aggregation to clear signatures before second validator check - Fix store.py lint: remove blank line after docstring, wrap long line - Fix type errors: peer_id accepts None for self-produced blocks throughout sync pipeline (SyncService, HeadSync, BlockCache) - Fix formatting in service.py, node_runner.py, test_multi_node.py
Replace fixed 70s duration loops with convergence helpers: - assert_all_finalized_to: polls until finalization target reached - assert_heads_consistent: polls until head slots converge - assert_same_finalized_checkpoint: polls until nodes agree This fixes CI flakiness where slow machines cause nodes to diverge during the fixed wait period. Tests now exit early on success and tolerate slower environments via generous timeouts.
- Increase gossipsub mesh stabilization from 5s to 10s in start_all (CI machines need more time for mesh formation before block production) - Increase finalization timeout from 100s to 150s - Increase peer connection timeout from 15s to 30s - Increase pytest timeout from 200s to 300s The CI failure showed all 3 nodes stuck at finalized slot 0, indicating gossip mesh wasn't fully formed when services started.
tcoratger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left couple of comments, but looks good to me in general. I just want to make sure that both in the interop test and in the test vector with the slots, we are testing the same behaviour as before.
| @@ -100,108 +100,31 @@ async def test_mesh_finalization(node_cluster: NodeCluster) -> None: | |||
| # Each node needs at least 2 peers (the other two nodes). | |||
| # This ensures gossip will reach all nodes. | |||
| # The 15s timeout handles slow handshakes. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a 30 seconds timeout now no? And not a 15 seconds timeout anymore
| Returns: | ||
| New Store with updated latest_new_aggregated_payloads. | ||
| Tuple of (new Store with updated payloads, list of new SignedAggregatedAttestation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to avoid variable names in the documentation because these names can change over time, rendering the documentation obsolete
| Tuple of (new Store with updated payloads, list of new SignedAggregatedAttestation). | |
| Tuple of (new store with updated payloads, list of new signed aggregated attestation). |
| Returns: | ||
| New Store with advanced time and interval-specific updates applied. | ||
| Tuple of (new Store with advanced time, list of new SignedAggregatedAttestation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Tuple of (new Store with advanced time, list of new SignedAggregatedAttestation). | |
| Tuple of (new store with advanced time, list of new signed aggregated attestation). |
| Tuple of (new Store with time advanced, | ||
| list of all produced SignedAggregatedAttestation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Tuple of (new Store with time advanced, | |
| list of all produced SignedAggregatedAttestation). | |
| Tuple of (new store with time advanced, | |
| list of all produced signed aggregated attestation). |
| # This ensures the target doesn't advance too far ahead of safe target, | ||
| # providing a balance between liveness and safety. | ||
| # | ||
| # MODIFIED: We allow the target to be up to 1 slot ahead of safe_target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # MODIFIED: We allow the target to be up to 1 slot ahead of safe_target | |
| # We allow the target to be up to 1 slot ahead of the safe target |
| # Validators only subscribe to the subnets they are assigned to. | ||
| # This matches the Ethereum gossip specification. | ||
| if validator_indices: | ||
| from lean_spec.subspecs.chain.config import ATTESTATION_COMMITTEE_COUNT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have the import top of the file instead?
| # Match Ream's 70 second test duration. | ||
| # Wait for finalization with convergence-based polling. | ||
| # | ||
| # Finalization requires sufficient time for: | ||
| # - Multiple slots to pass (4s each) | ||
| # - Attestations to accumulate | ||
| # - Justification and finalization to occur | ||
| run_duration = 70 | ||
| poll_interval = 5 | ||
|
|
||
| logger.info("Running chain for %d seconds (mesh_2_2_2)...", run_duration) | ||
|
|
||
| # Poll chain progress. | ||
| start = time.monotonic() | ||
| while time.monotonic() - start < run_duration: | ||
| slots = [node.head_slot for node in node_cluster.nodes] | ||
| finalized = [node.finalized_slot for node in node_cluster.nodes] | ||
| logger.info("Progress: head_slots=%s finalized=%s", slots, finalized) | ||
| await asyncio.sleep(poll_interval) | ||
| # Hub-and-spoke adds latency (messages route through hub) | ||
| # but the protocol should still achieve finalization. | ||
| await assert_all_finalized_to(node_cluster, target_slot=1, timeout=150) | ||
|
|
||
| # Final state capture. | ||
| head_slots = [node.head_slot for node in node_cluster.nodes] | ||
| finalized_slots = [node.finalized_slot for node in node_cluster.nodes] | ||
|
|
||
| logger.info("FINAL: head_slots=%s finalized=%s", head_slots, finalized_slots) | ||
|
|
||
| # Same assertions as full mesh. | ||
| # | ||
| # Despite reduced connectivity (messages route through hub), | ||
| # the protocol should still achieve full consensus. | ||
|
|
||
| # Chain must advance sufficiently. | ||
| assert all(slot >= 5 for slot in head_slots), ( | ||
| f"Chain did not advance enough. Head slots: {head_slots}" | ||
| ) | ||
|
|
||
| # Heads must be consistent across nodes. | ||
| # Verify heads converged across nodes. | ||
| # | ||
| # Hub-and-spoke adds latency but should not cause divergence. | ||
| head_diff = max(head_slots) - min(head_slots) | ||
| assert head_diff <= 2, f"Head slots diverged too much. Slots: {head_slots}, diff: {head_diff}" | ||
|
|
||
| # ALL nodes must finalize. | ||
| assert all(slot > 0 for slot in finalized_slots), ( | ||
| f"Not all nodes finalized. Finalized slots: {finalized_slots}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the documentation, I guess you removed all this due to CI limitation?
| Initially, the safe target is at genesis (slot 0). The attestation target | ||
| is allowed up to 1 slot ahead of safe target to ensure the chain can | ||
| start advancing from genesis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this new condition, are we sure that all the flow that we are testing is the same as before?
ποΈ Description
π Related Issues or PRs
β Checklist
toxchecks to avoid unnecessary CI fails:uvx tox