-
Notifications
You must be signed in to change notification settings - Fork 0
feat: add observability & operator controls (Phase 5) #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
bde4384 to
8ea80e8
Compare
9d61aac to
71e3120
Compare
49bf5bd to
1e039e2
Compare
18bce8f to
dd564b5
Compare
1e039e2 to
89b8e6f
Compare
Review: PR #7 - Observability & Operator Controls (Phase 5)Verdict: 🔴 Request Changes - Build Broken This PR adds important observability features but introduces compilation errors that block the entire stack. Build Errors (BLOCKING)Fixes Required
What's Good
Suggested Fix// metrics.go - Remove lines 69-82 (duplicate declarations)
// metrics.go - Remove line 5 ("sync" import)
// audit.go - Move metrics init into the loop:
for _, sc := range conf.Sources {
// ... existing code ...
a.metrics[sc.Name] = NewMetrics(sc.Name) // Add here
}
// Remove the fallback at lines 65-70Please fix build errors before merge. The logic is correct, just cleanup needed. |
|
Review findings (blocking):
These need fixes before merge. |
89b8e6f to
a953f1e
Compare
dd564b5 to
65772b6
Compare
0xBigBoss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Independent Code Review - Phase 5 (SEN-68 Log Completeness)
Findings (all fixed locally)
Critical: Data Race in task.go:755
Issue: Inside Task.load(), goroutines assigned to shared ctx:
eg.Go(func() error {
ctx = wctx.WithNumLimit(ctx, m, n) // Race: multiple goroutines write to shared ctxFix: Per-goroutine context:
eg.Go(func() error {
localCtx := wctx.WithNumLimit(ctx, m, n)Status: Fixed, go test -p 1 -race ./shovel passes.
Medium: Nil Manager Panic in repair.go:345
Issue: executeRepair() accessed rs.manager.tasks without nil check, causing panic in tests.
Fix: Added nil guard returning failed status.
Low: Bug Demonstration Test Failing
Issue: TestBugConditions_PartialLogs uses a single faulty provider that returns 2 of 4 logs. Test fails by design without consensus - it demonstrates the SEN-68 bug scenario, not the fix.
Fix: Added t.Skip() pointing to TestConsensus_PreventsMissingLogs which properly tests consensus.
Recommendation: Convert this test to verify consensus detects partial data (requires MAINNET_RPC_URL), rather than leaving it skipped.
Verification
| Check | Result |
|---|---|
go test -p 1 ./... |
PASS |
go test -p 1 -race ./shovel |
PASS |
go build ./cmd/shovel |
PASS |
| Hardcoded API keys | None (uses MAINNET_RPC_URL) |
Prior Review Findings (Verified)
| Finding | Status | Evidence |
|---|---|---|
| Escalation unanimity bug | ✅ Fixed | audit.go:300 uses allMatches >= threshold |
| Duplicate AuditAttempt call | ✅ Fixed | Single call at audit.go:308 |
| Queue count SQL | ✅ Fixed | Commit a953f1e |
| Hardcoded API key | ✅ Fixed | Tests skip without MAINNET_RPC_URL |
Action Items
- Commit the three fixes (data race, nil guard, test skip)
- Follow-up: Convert
TestBugConditions_PartialLogsto proper consensus verification test - Consider: Add
-raceto CI if not present
Fixes:
- Remove unused sync import from metrics.go
- Remove duplicate AuditVerifications and AuditQueueLength declarations (lines 69-82)
- Fix AuditVerifications labels from {src_name, status} to {src_name, ig_name} to match actual usage
- Move metrics initialization inside source loop to fix scope error where sc.Name was undefined
- Populate a.metrics map correctly for each enabled source
- Update AuditVerifications calls to use t.igName instead of hardcoded strings
Build now compiles successfully. All audit, metrics, and consensus tests pass.
Pre-existing test failures (TestIntegrations, TestBugConditions_PartialLogs, TestConverge_DeltaBatchSize) are unrelated to Phase 5.
Resolves build errors blocking Phase 5 deployment.
Co-Authored-By: Warp <agent@warp.dev>
65772b6 to
7bd0835
Compare
a953f1e to
d458e5e
Compare
- Remove duplicate AuditAttempt() at line 231 that was inflating metrics 2x - Fix countQ SQL to use MAX() correlated subquery instead of JOIN The JOIN overcounted when task_updates had multiple rows per (src_name, ig_name) Now matches the pattern used in the main audit query (lines 128-140) Co-Authored-By: Warp <agent@warp.dev>
d458e5e to
b27ddff
Compare

Why:
Implement Phase 5 of SEN-68: Observability & Operator Controls.
Adds metrics for audit loop and forced reindexes.
Adds Runbook for operators to tune redundancy and handle alerts.
Fixes bug where ConsensusDuration was only recorded once.
Includes:
shovel_audit_failures_total, shovel_audit_queue_length.
Test plan: