Skip to content

Conversation

@manav2401
Copy link
Member

@manav2401 manav2401 commented Jan 22, 2026

Description

This PR introduces a new relay architecture for enabling preconfirmations and private transactions. The relay is intended to act as a middleware between end users / wallets and block producers. It will be able to route transactions for specific purpose directly to block producers. More details below:

Newly introduced flags:

  • --relay.enable-preconfs: Enables relay to offer preconfirmations
  • --relay.enable-private-tx: Enables relay to route transactions privately
  • --relay-bp-rpc-endpoints: RPC endpoints of all active block producers for transaction submission
  • --accept-preconf-tx: Enables block producers to accept transactions for preconfirmation
  • --accept-private-tx: Enables block producers to accept private txs
  • --disable-tx-propagation: Disables transaction announcement / broadcast in public p2p network

General transaction flow:

  • The relay based on it's purpose (can be either preconfirmations or private tx) accepts transactions via usual eth_sendRawTransaction endpoint.
  • The relay ensures that it's connected to the RPC endpoints of all active block producers.
  • It directly submits the transactions to all block producers on their respective endpoints and responds based on the type of transaction.

Preconf flow:

  • The relay calls a newly added endpoint named eth_sendTransactionForPreconf in all block producers.
  • The block producers immediately responds with the tx inclusion status via a response type defined below
    • { "hash": "0x...00", "preconfirmed": true }
  • Relay stores the result in a cache.
  • User can check the status on relays via a new endpoint called eth_checkPreconfStatus
    • It either returns a cached result
    • Or makes requests to all block producers to check the tx status via txpool_txStatus

Private tx flow:

  • The relay calls a newly added endpoint named eth_sendTransactionPrivate in all block producers.
  • In case of an error, the tx is retried for submission in background.
  • The relay needs to run with -disable-tx-propagation flag which ensures that no transactions are gossiped in the public p2p network.
  • The block producers cache these transactions to ensure that they exclude them during the usual p2p broadcast / announcement.
  • This ensures that transactions aren't publicly visible to any node except the relay and block producers which are trusted entities.

General nuances of the feature:

  • To avoid load on block producers, preconf requests are queued for processing. At any given time, only 1024 tasks can be processed.
  • A multiclient instance is developed which manages the RPC connections with all block producers. It exposes certain methods which give a unified response by querying all block producers. A future improvement would be to give priority to current block producer and not all (by checking the span).

TODOs:

  • Add persistent for private transactions as they get missing if not stored.
  • Add unit and integration tests.

Changes

  • Bugfix (non-breaking change that solves an issue)
  • Hotfix (change that solves an urgent issue, and requires immediate attention)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)
  • Changes only for a subset of nodes

Breaking changes

Please complete this section if any breaking changes have been made, otherwise delete it

Nodes audience

In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)

Checklist

  • I have added at least 2 reviewer or the whole pos-v1 team
  • I have added sufficient documentation in code
  • I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
  • Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
  • Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

  • This PR requires changes to heimdall
    • In case link the PR here:
  • This PR requires changes to matic-cli
    • In case link the PR here:

Testing

  • I have added unit tests
  • I have added tests to CI
  • I have tested this code manually on local environment
  • I have tested this code manually on remote devnet using express-cli
  • I have tested this code manually on amoy
  • I have created new e2e tests into express-cli

Manual tests

Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it

Additional comments

Please post additional comments in this section if you have them, otherwise delete it

@codecov
Copy link

codecov bot commented Jan 22, 2026

Codecov Report

❌ Patch coverage is 71.65775% with 212 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.65%. Comparing base (e61aaf9) to head (f153d32).

Files with missing lines Patch % Lines
eth/relay/relay.go 0.00% 85 Missing ⚠️
internal/ethapi/api.go 0.00% 64 Missing and 2 partials ⚠️
eth/api_backend.go 0.00% 20 Missing ⚠️
eth/relay/multiclient.go 92.92% 11 Missing and 4 partials ⚠️
eth/relay/private_tx_store.go 82.71% 12 Missing and 2 partials ⚠️
eth/relay/service.go 92.94% 8 Missing and 3 partials ⚠️
eth/backend.go 93.33% 1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (71.65%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2004      +/-   ##
===========================================
+ Coverage    49.58%   49.65%   +0.07%     
===========================================
  Files          873      877       +4     
  Lines       150507   151196     +689     
===========================================
+ Hits         74622    75081     +459     
- Misses       70850    71067     +217     
- Partials      5035     5048      +13     
Files with missing lines Coverage Δ
eth/ethconfig/config.go 37.50% <ø> (ø)
eth/handler.go 64.16% <100.00%> (+0.71%) ⬆️
eth/sync.go 44.56% <100.00%> (+0.30%) ⬆️
internal/cli/server/config.go 62.31% <100.00%> (+0.62%) ⬆️
internal/cli/server/flags.go 100.00% <100.00%> (ø)
internal/ethapi/backend.go 0.00% <ø> (ø)
p2p/config.go 0.00% <ø> (ø)
eth/backend.go 52.45% <93.33%> (-0.05%) ⬇️
eth/relay/service.go 92.94% <92.94%> (ø)
eth/relay/private_tx_store.go 82.71% <82.71%> (ø)
... and 4 more

... and 18 files with indirect coverage changes

Files with missing lines Coverage Δ
eth/ethconfig/config.go 37.50% <ø> (ø)
eth/handler.go 64.16% <100.00%> (+0.71%) ⬆️
eth/sync.go 44.56% <100.00%> (+0.30%) ⬆️
internal/cli/server/config.go 62.31% <100.00%> (+0.62%) ⬆️
internal/cli/server/flags.go 100.00% <100.00%> (ø)
internal/ethapi/backend.go 0.00% <ø> (ø)
p2p/config.go 0.00% <ø> (ø)
eth/backend.go 52.45% <93.33%> (-0.05%) ⬇️
eth/relay/service.go 92.94% <92.94%> (ø)
eth/relay/private_tx_store.go 82.71% <82.71%> (ø)
... and 4 more

... and 18 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@cffls cffls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question about a hypothetical scenario - if somehow a node receives a transaction but got killed before it broadcasts the txn, that txn will get lost. It seems like a hard edge case where we need to persist the txn data on disk?

@claude
Copy link

claude bot commented Jan 27, 2026

Code Review

Found 3 issues during code review.

Issues

  1. Resource leak in eth/relay/multiclient.go:52-55 - When rpc.Dial() succeeds but eth_blockNumber call fails, the RPC client is not closed before continuing to the next URL, causing a connection leak.

  2. CLAUDE.md violation in eth/relay/multiclient.go:58 - Error from hexutil.DecodeUint64 is ignored with _, violating the CLAUDE.md rule: "Handle Errors: Never ignore errors with _"

  3. Missing nil check in eth/relay/service.go:149 - CheckTxPreconfStatus calls s.multiclient.checkTxStatus(hash) without checking if s.multiclient is nil, inconsistent with the defensive programming pattern used in SubmitTransactionForPreconf and SubmitPrivateTx.


Checked for bugs and CLAUDE.md compliance.

@manav2401
Copy link
Member Author

manav2401 commented Jan 27, 2026

A question about a hypothetical scenario - if somehow a node receives a transaction but got killed before it broadcasts the txn, that txn will get lost. It seems like a hard edge case where we need to persist the txn data on disk?

Great point. Yeah, I've kept a note of this to add persistence for storing transactions / preconf tasks. Due to the Amoy rollout next week, I am planning to add this in next iteration.

Although, I've discussed the approach to use when restarting bor w/ Devops. It doesn't require persistence and can still ensure that no tx submission requests are dropped. Thanks!

@claude
Copy link

claude bot commented Jan 28, 2026

Code Review

Found 4 issues during code review.

Issue 1: Missing nil check in CheckTxPreconfStatus

File: eth/relay/service.go:184

The CheckTxPreconfStatus method calls s.multiclient.checkTxStatus(hash) without checking if s.multiclient is nil first. This will cause a nil pointer panic when multiclient is nil (which can happen if no block producer URLs are provided or all connections fail).

Other methods like SubmitTransactionForPreconf (line 88) and SubmitPrivateTx (line 118) correctly check for s.multiclient == nil before using it.

Fix: Add a nil check before line 184:

if s.multiclient == nil {
    return false, errRpcClientUnavailable
}

Reference: CLAUDE.md - Common Pitfalls: "Nil Checks: Check interface values and pointers before use"


Issue 2: Inconsistent preconf status check logic

File: eth/relay/multiclient.go:275-276

In checkTxStatus, only TxStatusPending is counted as preconfirmed:

if txStatus == txpool.TxStatusPending {
    preconfOfferedCount.Add(1)
}

However, in internal/ethapi/api.go SendRawTransactionForPreconf (lines 2263-2267), both TxStatusPending and TxStatusIncluded are treated as valid preconfirmations:

if txStatus == txpool.TxStatusPending || txStatus == txpool.TxStatusIncluded {
    txConfirmed = true
}

This inconsistency means that if a transaction is already included in a block, the block producer's endpoint reports it as preconfirmed, but the relay's status check would report it as NOT preconfirmed.

Fix: Update line 275 in multiclient.go to also check for TxStatusIncluded:

if txStatus == txpool.TxStatusPending || txStatus == txpool.TxStatusIncluded {
    preconfOfferedCount.Add(1)
}

Issue 3: Race condition in test - unprotected map access

File: eth/handler_eth_test.go:821-829 and eth/handler_eth_test.go:923

The PrivateTxStore struct uses an unprotected map[common.Hash]struct{}:

type PrivateTxStore struct {
    store map[common.Hash]struct{}
}

This map is:

  • Written to from the main goroutine (lines 869-870, 940-941)
  • Read from multiple concurrent goroutines (lines 923, 953)

While there is a wg.Wait() between some writes and reads, the Go memory model requires explicit synchronization (not just program order) to establish happens-before relationships between goroutines.

Fix: Add a sync.RWMutex to protect map access:

type PrivateTxStore struct {
    mu    sync.RWMutex
    store map[common.Hash]struct{}
}

func (p *PrivateTxStore) IsTxPrivate(hash common.Hash) bool {
    p.mu.RLock()
    defer p.mu.RUnlock()
    _, ok := p.store[hash]
    return ok
}

Reference: CLAUDE.md - Common Pitfalls: "Race Conditions: Use -race flag, protect shared state"


Issue 4: Resource accumulation due to defer in loop

File: eth/handler_eth_test.go:757-765 and eth/handler_eth_test.go:881-889

Both testDisableTxPropagation and testPrivateTxNotPropagated use defer statements inside a for loop:

for i, sink := range sinks {
    sourcePipe, sinkPipe := p2p.MsgPipe()
    defer sourcePipe.Close()
    defer sinkPipe.Close()
    // ...
    defer sourcePeer.Close()
    defer sinkPeer.Close()
    // ...
}

Deferred functions execute when the function returns, not when the loop iteration ends. With 10 iterations, all 40 resources remain open throughout the entire test until the function exits, which can cause resource exhaustion.

Fix: Either close resources explicitly at the end of each iteration, or wrap the loop body in an anonymous function:

for i, sink := range sinks {
    func() {
        sourcePipe, sinkPipe := p2p.MsgPipe()
        defer sourcePipe.Close()
        defer sinkPipe.Close()
        // ...
    }()
}

Reference: CLAUDE.md - Common Pitfalls: "Close Resources: Channels, files, DB iterators, HTTP bodies"

@manav2401
Copy link
Member Author

@claude addressed the comments. Related to the last issue of resource accumulation, they're intended to be cleared at the end of function.

@claude
Copy link

claude bot commented Jan 29, 2026

Claude finished @manav2401's task in 39s —— View job


Verification Results

  • Check Issue 1: Missing nil check in CheckTxPreconfStatus
  • Check Issue 2: Inconsistent preconf status check logic
  • Check Issue 3: Race condition in test - unprotected map access
  • Acknowledge Issue 4: Resource accumulation (intentional behavior)

All issues addressed. Here's my verification:

Issue 1: Missing nil check ✅ Fixed

The nil check for s.multiclient has been added at eth/relay/service.go:182-184:

if s.multiclient == nil {
    return false, errRpcClientUnavailable
}

Issue 2: Inconsistent preconf status check ✅ Fixed

Both SendRawTransactionForPreconf and checkTxStatus now consistently check only for TxStatusPending:

The logic is now consistent across both files.

Issue 3: Race condition in test ✅ Fixed

The PrivateTxStore struct now includes proper synchronization with sync.RWMutex at eth/handler_eth_test.go:822-832:

type PrivateTxStore struct {
    mu    sync.RWMutex
    store map[common.Hash]struct{}
}

func (p *PrivateTxStore) IsTxPrivate(hash common.Hash) bool {
    p.mu.RLock()
    defer p.mu.RUnlock()
    _, ok := p.store[hash]
    return ok
}

All map accesses are now properly protected with mutex locks.

Issue 4: Resource accumulation ✅ Acknowledged

Understood - the defer statements in loops are intentional since the resources should remain open for the duration of the test function, not just the loop iteration. The test needs all peer connections to remain open throughout the test execution.


@manav2401 manav2401 changed the base branch from raneet10/new-upstream-v1.16.7 to develop February 3, 2026 07:09
@manav2401
Copy link
Member Author

@claude can you review this PR, focusing on safety and performance? Provide a summary as a comment highlighting any concerns related to these changes.

@claude
Copy link

claude bot commented Feb 3, 2026

Claude finished @manav2401's task in 1m 48s —— View job


Safety & Performance Review

  • Gather context and read changed files
  • Analyze safety concerns (race conditions, nil checks, error handling)
  • Analyze performance concerns (goroutines, allocations, blocking)
  • Review test coverage adequacy
  • Provide summary with concerns

Summary

This PR introduces a solid relay architecture for preconfirmations and private transactions. The implementation follows good practices overall, with proper mutex usage, graceful shutdown handling, and reasonable test coverage. Below are my findings focused on safety and performance.


Safety Concerns

1. time.Sleep in API Handler (Medium)

Location: internal/ethapi/api.go:2252

time.Sleep(100 * time.Millisecond)

A fixed 100ms sleep in SendRawTransactionForPreconf blocks the RPC handler. Under high load, this delays responses and can exhaust connection pools. The comment acknowledges this needs improvement.

Recommendation: Consider using a subscription-based approach or polling with a configurable timeout that respects the request context.


2. Unbounded Goroutine Spawning in Retry Loop (Low-Medium)

Location: eth/relay/multiclient.go:224

for _, index := range currentFailedIndices {
    retryWg.Add(1)
    go func(client *rpc.Client, idx int) {
        // ...
    }(mc.clients[index], index)
}

Each retry iteration spawns N goroutines (one per failed block producer). While bounded by the number of BPs and privateTxMaxRetries (5), a malicious actor could flood private tx submissions, causing goroutine accumulation during retries.

Mitigations in place: Max 5 retries, 2s interval, limited by number of BPs.


3. Background Retry Has No Cancellation Context (Low)

Location: eth/relay/multiclient.go:185

go mc.retryPrivateTxSubmission(hexTx, hash, failedIndices, txGetter)

The background retry goroutine has no way to be cancelled during graceful shutdown. While close() cleans up RPC clients, active retry goroutines will continue until completion or timeout.

Recommendation: Pass a context from the service that can be cancelled during shutdown.


4. Potential Panic if clients Slice is Mutated (Low)

Location: eth/relay/multiclient.go:241

go func(client *rpc.Client, idx int) {
    // ...
}(mc.clients[index], index)

If mc.clients were ever modified while retryPrivateTxSubmission is running, this could cause a race. Currently safe because clients is immutable after initialization, but worth documenting.


Performance Considerations

1. Semaphore Blocking Pattern (Good ✓)

Location: eth/relay/service.go:123

The semaphore pattern for limiting concurrent tasks is well-implemented:

s.semaphore <- struct{}{}
go func(task TxTask) {
    defer func() { <-s.semaphore }()
    s.processPreconfTask(task)
}(task)

Default of 1024 concurrent tasks is reasonable.


2. Cache Lock Contention Under High Load (Medium)

Location: eth/relay/service.go:160-161

s.storeMu.Lock()
defer s.storeMu.Unlock()

The updateTaskInCache function holds a full lock (not RLock) for the entire update operation. Under high concurrency, this becomes a bottleneck since all cache writes are serialized.

Recommendation: Consider using sync.Map or a sharded lock approach if contention becomes an issue.


3. Map Iteration During Cleanup (Low)

Location: eth/relay/service.go:264-269

s.storeMu.Lock()
for hash, task := range s.store {
    if now.Sub(task.insertedAt) > s.config.expiryInterval {
        delete(s.store, hash)
    }
}
s.storeMu.Unlock()

Full lock during map iteration blocks all concurrent read/write operations. With 40k tasks and 10-minute expiry, the map could grow large.

Mitigations: Only runs every minute, and concurrent processing is rate-limited.


4. Channel Buffer Sizing (Good ✓)

Location: eth/relay/service.go:31-35

maxQueuedTasks:       40_000,
maxConcurrentTasks:   1024,

Default values are reasonable for expected load. Queue overflow returns explicit error rather than blocking.


5. RPC Timeout Configuration (Good ✓)

Location: eth/relay/multiclient.go:17-18

rpcTimeout             = 2 * time.Second
privateTxRetryInterval = 2 * time.Second

2-second timeouts prevent hanging on unresponsive block producers.


Test Coverage

The test suite is comprehensive with coverage of:

  • ✓ Queue overflow scenarios
  • ✓ Concurrent task processing
  • ✓ Cache update race conditions
  • ✓ Private transaction propagation blocking
  • ✓ "Already known" error handling
  • ✓ Retry logic with TxGetter

However, the codecov report shows some files have low coverage:

  • eth/relay/relay.go: 0% (mostly thin wrappers, acceptable)
  • internal/ethapi/api.go: 0% on new code (new RPC endpoints)
  • eth/api_backend.go: 0% on new code

Additional Notes

Addressed from Previous Reviews ✓

I note that issues from the previous Claude review have been addressed:

  • Nil check added for multiclient in CheckTxPreconfStatus
  • Preconf status check logic made consistent
  • Race condition in test fixed with sync.RWMutex

Transaction Persistence (Acknowledged)

The PR description notes that persistence for private transactions is a TODO for the next iteration, which would address data loss during node restarts.


Verdict

The implementation is well-architected for the relay use case with appropriate safety mechanisms. The main areas to watch are:

  1. The 100ms sleep in the RPC handler under high load
  2. Cache lock contention if task volume grows significantly
  3. Background retry goroutines during shutdown

These are not blocking issues for the current scope, but worth monitoring in production.


@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 3, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
16.6% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants