Skip to content

Conversation

@youngkidwarrior
Copy link

Why:
Implements the External Repair API (Phase 4 of SEN-68) to allow manual
re-verification and re-indexing of block ranges. This provides an
operator "escape hatch" for correcting data inconsistencies that automatic
mechanisms might miss or fail to recover.

Includes:

  • RepairService with advisory locking and async execution
  • HTTP API endpoints: POST /api/v1/repair, GET /api/v1/repair/{id}, GET /api/v1/repairs
  • Prometheus metrics for repair requests and reprocessed blocks
  • Integration with Manager for Task deletion and recovery

Test plan:

  • Run unit tests: go test -v ./shovel -run TestHandleRepairRequest
  • Verify repair job creation and status tracking via API
  • Verify metrics increment on repair actions

Copy link
Author

youngkidwarrior commented Nov 25, 2025

@youngkidwarrior youngkidwarrior changed the base branch from add_confirmation_audit_loop to graphite-base/6 November 28, 2025 01:17
@youngkidwarrior youngkidwarrior force-pushed the feat_add_external_repair_api branch from a0f6009 to 9d61aac Compare December 12, 2025 04:05
@youngkidwarrior youngkidwarrior changed the base branch from graphite-base/6 to add_confirmation_audit_loop December 12, 2025 04:05
@youngkidwarrior youngkidwarrior force-pushed the feat_add_external_repair_api branch from 9d61aac to 71e3120 Compare December 14, 2025 08:17
@youngkidwarrior youngkidwarrior force-pushed the add_confirmation_audit_loop branch 2 times, most recently from 17d3865 to 210adfe Compare December 14, 2025 09:31
@youngkidwarrior youngkidwarrior force-pushed the feat_add_external_repair_api branch from 71e3120 to 18bce8f Compare December 14, 2025 09:31
@youngkidwarrior youngkidwarrior force-pushed the add_confirmation_audit_loop branch from 210adfe to b3a914c Compare December 19, 2025 01:23
@youngkidwarrior youngkidwarrior force-pushed the feat_add_external_repair_api branch from 18bce8f to dd564b5 Compare December 19, 2025 01:23
@0xBigBoss
Copy link
Member

Review: PR #6 - External Repair API (Phase 4)

Verdict: ✅ Approve with minor suggestions

This provides the operator "escape hatch" for manual recovery when automatic mechanisms fail.

Strengths

  1. Complete REST API - POST create, GET status, GET list with filtering
  2. Good validation - Source/integration existence checks, range limits (10k blocks max), start <= end
  3. Async execution - Background goroutine with UUID tracking
  4. Audit trail - shovel.repair_audit table logs all requests with requester IP
  5. Dry-run mode - Count affected rows without deletion
  6. Metrics - shovel_repair_requests_total, shovel_repair_blocks_reprocessed_total

API Design

POST /api/v1/repair
{
  "source": "mainnet",
  "integration": "seaport",
  "start_block": 18000000,
  "end_block": 18000100,
  "force_all_providers": true,
  "dry_run": false
}

Clean, RESTful, well-documented.

Issues Found

  1. force_all_providers not used - The flag is parsed but executeRepair() doesn't check it. This should disable consensus and rotate through all URLs.

  2. Advisory lock scope - Uses wpg.LockHash() for per-repair locking, but should also prevent overlapping repairs on the same block range.

  3. Context propagation - executeRepair uses context.Background() - consider passing a cancellable context for graceful shutdown.

Suggestions

  1. Add rate limiting to prevent API abuse
  2. Consider adding a webhook callback option for completion notification
  3. Add support for repair queue inspection

Main Routes

mux.Handle("/api/v1/repair", wh.Authn(wh.HandleRepairRequest))
mux.Handle("/api/v1/repair/", wh.Authn(wh.HandleRepairStatus))
mux.Handle("/api/v1/repairs", wh.Authn(wh.HandleListRepairs))

Properly behind authentication.

Good Phase 4 implementation. Minor fix needed for force_all_providers.

@0xBigBoss
Copy link
Member

Review findings:

  • force_all_providers is parsed but never used; repair path does not adjust provider set or consensus behavior. shovel/repair.go.
  • executeRepair only deletes; it never reindexes or updates blocks_reprocessed, so status reports are misleading and the “repair” doesn’t actually repair data. shovel/repair.go:337-351.
  • Repair deletes run without taking the task’s advisory lock (task.lockid), so it can race with live indexing. Consider reusing the same lock pattern used in Task.Converge().
  • Overlapping repair rejection (HTTP 409) is specified in docs but not implemented.

Suggest clarifying intended behavior (delete-only vs delete+reindex) and aligning code + runbook/spec before merge.

@youngkidwarrior youngkidwarrior changed the base branch from add_confirmation_audit_loop to graphite-base/6 December 24, 2025 14:26
@youngkidwarrior youngkidwarrior marked this pull request as ready for review December 27, 2025 10:53
@youngkidwarrior youngkidwarrior force-pushed the feat_add_external_repair_api branch from dd564b5 to aef31d2 Compare December 27, 2025 10:58
@youngkidwarrior youngkidwarrior changed the base branch from graphite-base/6 to add_confirmation_audit_loop December 27, 2025 10:58
@youngkidwarrior youngkidwarrior force-pushed the feat_add_external_repair_api branch from aef31d2 to 65772b6 Compare December 27, 2025 15:34
@youngkidwarrior youngkidwarrior force-pushed the add_confirmation_audit_loop branch from 4d3526b to 08230d4 Compare December 27, 2025 15:34
Copy link
Author

youngkidwarrior commented Dec 30, 2025

Merge activity

  • Dec 30, 1:19 AM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Dec 30, 1:26 AM UTC: Graphite rebased this pull request as part of a merge.
  • Dec 30, 1:27 AM UTC: @youngkidwarrior merged this pull request with Graphite.

@youngkidwarrior youngkidwarrior changed the base branch from add_confirmation_audit_loop to graphite-base/6 December 30, 2025 01:24
@youngkidwarrior youngkidwarrior changed the base branch from graphite-base/6 to main December 30, 2025 01:25
Important improvements:
- Fix context cancellation in background goroutine with timeout
- Add table name validation for SQL injection hardening
- Fix block verification idempotency to preserve audit progress
- Add force_all_providers parameter test

Context management:
- Background repairs now derive context from HTTP request with generous timeout
- Prevents resource leaks when clients disconnect mid-repair
- Timeout scales with block range (1 hour per 1000 blocks, minimum 10 minutes)

SQL injection hardening:
- Validate table names against integration config before SQL interpolation
- Defense-in-depth approach protects against future refactoring risks

State machine idempotency:
- Block verification updates now preserve non-terminal audit states
- Only resets 'failed', 'receipt_unverified', 'retrying' states to 'pending'
- Preserves 'healthy', 'verifying', 'pending' states during repairs

Test improvements:
- Add unit test validating force_all_providers parameter acceptance
- Integration test in repair_integration_test.go covers full execution path
- All existing repair tests pass with race detector

Addresses important issues from professional code review.

Co-Authored-By: Warp <agent@warp.dev>
@youngkidwarrior youngkidwarrior force-pushed the feat_add_external_repair_api branch from 65772b6 to a81af2c Compare December 30, 2025 01:26
@youngkidwarrior youngkidwarrior merged commit 7bd0835 into main Dec 30, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants