Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 29, 2025

  • Explore repository structure and understand the codebase
  • Read issue Automatic initial shard/cdc replica sync #78 to understand requirements
  • Create detailed implementation plan with flow control, protocol, and retry mechanisms
  • Address reviewer feedback on TCP handling and message definitions
  • Address reviewer feedback on CF name matching (not just indices)
  • Implement sync protocol messages and constants
  • Implement ShardSyncServer (TCP server for shard sync)
  • Await confirmation to continue with CDC sync server and client implementation

Implementation Progress

Completed: Sync Protocol Messages

Added new sync message types for replica database synchronization:

New Message Types (go/msgs/msgs.go):

  • ColumnFamilyInfo - Maps CF index to name for proper matching between replicas
  • SyncStartReq - Request to start sync (contains lastAppliedLogEntry)
  • SyncStartResp - Response with snapshot metadata (snapshotLogEntry, columnFamilies[], totalSizeEstimate)
  • SyncChunkReq - Request for data chunk (cfIndex, cfName, keyStart, maxSize)
  • SyncChunkResp - Response with chunk data (keys, values, nextKeyStart, isLastCF)
  • SyncCompleteReq/Resp - Finalize sync operation

New Protocol Constants (cpp/core/Protocol.hpp and go/msgs/msgs.go):

// 'SYN\0' and 'SYN\1' protocol versions
constexpr uint32_t SYNC_REQ_PROTOCOL_VERSION = 0x4e5953;
constexpr uint32_t SYNC_RESP_PROTOCOL_VERSION = 0x14e5953;

Generated Code:

  • Updated bincodegen.go to include sync messages
  • Generated SyncMessageKind enum (SYNC_START, SYNC_CHUNK, SYNC_COMPLETE)
  • Generated SyncReqContainer and SyncRespContainer types
  • Generated serialization/deserialization code for all sync messages
  • Added SyncReqMsg and SyncRespMsg type aliases

Completed: ShardSyncServer

Added TCP server for shard sync protocol (cpp/shard/ShardSyncServer.{hpp,cpp}):

Features:

  • Epoll-based non-blocking TCP server following RegistryServer pattern
  • Configurable bind address and max connections
  • Message framing: [protocol_version:u32][length:u32][payload:bytes]
  • Client state tracking with read/write buffers
  • Request/response lifecycle management

API:

  • init() - Initialize listen socket and epoll
  • receiveMessages(timeout) - Poll for incoming sync requests
  • sendSyncResponses(responses) - Send responses to clients
  • receivedSyncRequests() - Get received requests for processing

Next Steps (Awaiting Confirmation)

  • Implement CDCSyncServer (TCP server for CDC sync)
  • Implement SyncClient (destination sync logic)
  • Add needsInitialSync() to SharedRocksDB
  • Integrate sync into Shard/CDC startup
Original prompt

issue #78
plan first and wait for confirmation before executing on the plan


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI changed the title [WIP] Plan and wait for confirmation before execution Add implementation plan for automatic shard/CDC replica sync Nov 29, 2025
Copilot AI requested a review from mcrnic November 29, 2025 22:23
Copy link
Collaborator

@mcrnic mcrnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Plan in more detail how will the destination replica control flow of data from source replica.
Which network protocol will be used, how does it fit in current request/response flow.
How will the retry happen in case source or destination gets shut down?

Copy link
Contributor Author

Copilot AI commented Nov 29, 2025

@copilot Plan in more detail how will the destination replica control flow of data from source replica.
Which network protocol will be used, how does it fit in current request/response flow.
How will ...

I've expanded the implementation plan with detailed answers to your questions:

Flow Control: Destination-driven pull model where the new replica controls pacing by requesting one chunk at a time. The destination specifies maxSize per request, provides cfIndex + offset for resume points, and only requests the next chunk after processing the current one.

Protocol: TCP with Bincode serialization following the existing RegistryClient.cpp pattern. Uses [protocol_version:u32][length:u32][payload:bytes] framing with new SYNC_REQ_PROTOCOL_VERSION and SYNC_RESP_PROTOCOL_VERSION constants.

Retry Handling:

  • Source shutdown: Destination detects TCP error, clears partial data, queries Registry for other healthy replicas, retries with different source using exponential backoff (1s → 60s max)
  • Destination shutdown: Source cleans up temp snapshot; destination restarts sync from beginning after reboot
  • Network partition: After 3 consecutive chunk timeouts (30s each), abort and try different replica

See full PR description for sequence diagram, message types, and state machine details.

Copilot AI requested a review from mcrnic November 29, 2025 22:35
Copy link
Collaborator

@mcrnic mcrnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot shard and cdc currently don't accept or handle tcp. this needs to be implemented similar to how it's done in registry. There is no msgs.yaml
msgs definitions are added in msgs.go and need to be added to bincodegen as well.
Please also explain how will sync decide which column family it's requesting/getting data for and how will source find next chunk to iterate

Copy link
Contributor Author

Copilot AI commented Nov 29, 2025

@copilot shard and cdc currently don't accept or handle tcp. this needs to be implemented similar to how it's done in registry. There is no msgs.yaml
msgs definitions are added in msgs.go and need to ...

Updated the plan to address your feedback:

TCP Implementation: Shards/CDC will get new TCP servers modeled after RegistryServer pattern (_acceptConnection, _readClient, _writeClient, epoll-based client tracking). Added ShardSyncServer and CDCSyncServer to the files list.

Message Definitions: Corrected to use msgs.go + bincodegen.go (not YAML). Added concrete Go struct definitions for SyncStartReq/Resp, SyncChunkReq/Resp, SyncCompleteReq/Resp and the corresponding bincodegen registration.

Column Family Selection & Chunk Iteration:

  • Destination decides which CF: Tracks _currentCfIndex and _currentKeyStart. Initially requests cfIndex=0 with empty keyStart. After each response, either advances keyStart within same CF, or moves to next CF when current is complete.
  • Source finds next chunk: Uses RocksDB iterator with Seek(keyStart) or SeekToFirst(). Reads up to maxSize bytes, returns nextKeyStart for resume point. When iterator exhausts, returns empty nextKeyStart to signal CF completion.

See full PR description for code examples and sequence diagram.

Copilot AI requested a review from mcrnic November 29, 2025 22:46
Copy link
Collaborator

@mcrnic mcrnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot the cf index is not enough. Destination needs to match index to local column family name. There is no guarantee indices will be the same.
The names of column families need to match. I suggest transferring index and cf name pairs in initial sync response.

Copy link
Contributor Author

Copilot AI commented Nov 29, 2025

@copilot the cf index is not enough. Destination needs to match index to local column family name. There is no guarantee indices will be the same.
The names of column families need to match. I suggest...

Updated the plan to address CF name matching:

SYNC_START_RESPONSE now includes columnFamilies: []ColumnFamilyInfo - a list of {index, name} pairs for all column families in the snapshot.

Destination uses names for local CF lookup:

  • Initializes _localCfHandles map from CF name → local handle
  • Creates missing CFs if needed
  • Writes chunks using CF name lookup, not index

Both request and response include cfName for verification:

  • SYNC_CHUNK_REQUEST includes both cfIndex and cfName
  • Source verifies cfName matches expected name for that index
  • SYNC_CHUNK_RESPONSE echoes back cfName for destination to verify

See updated PR description for complete ColumnFamilyInfo struct definition and flow diagram.

Copilot AI requested a review from mcrnic November 29, 2025 22:55
Copy link
Collaborator

@mcrnic mcrnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot start implementation.
Add only new messages and protocol for now and wait for confirmation to continue

…nization

Co-authored-by: mcrnic <11664456+mcrnic@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Nov 29, 2025

@copilot start implementation.
Add only new messages and protocol for now and wait for confirmation to continue

Added sync protocol messages and constants in commit 694a6d9:

Messages (msgs.go):

  • ColumnFamilyInfo - maps CF index to name
  • SyncStartReq/Resp - initiate sync with CF mapping
  • SyncChunkReq/Resp - transfer data chunks with CF name verification
  • SyncCompleteReq/Resp - finalize sync

Protocol (Protocol.hpp):

  • SYNC_REQ_PROTOCOL_VERSION = 0x4e5953 ('SYN\0')
  • SYNC_RESP_PROTOCOL_VERSION = 0x14e5953 ('SYN\1')
  • SyncReqMsg/SyncRespMsg type aliases

Generated: SyncMessageKind enum, SyncReqContainer/SyncRespContainer, serialization code.

Awaiting confirmation to continue with TCP server implementation.

Copilot AI requested a review from mcrnic November 29, 2025 23:10
Copy link
Collaborator

@mcrnic mcrnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot proceed with shard sync server implementation then wait confirmation

Co-authored-by: mcrnic <11664456+mcrnic@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Nov 29, 2025

@copilot proceed with shard sync server implementation then wait confirmation

Added ShardSyncServer in commit 7e01e71:

Files: cpp/shard/ShardSyncServer.{hpp,cpp}

Features:

  • Epoll-based non-blocking TCP server following RegistryServer pattern
  • Message framing: [protocol_version:u32][length:u32][payload:bytes]
  • Client state tracking with read/write buffers
  • Configurable bind address and max connections

API:

  • init() - Initialize listen socket and epoll
  • receiveMessages(timeout) - Poll for incoming sync requests
  • sendSyncResponses(responses) - Send responses to clients
  • receivedSyncRequests() - Get received requests for processing

Awaiting confirmation to continue with CDC sync server and client implementation.

Copilot AI requested a review from mcrnic November 29, 2025 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants