Merge with upstream and made replication as an optional support #8

hkadayam · 2025-08-08T13:45:49Z

No description provided.

When replacing a member, add the new member, sync raft log for replace and finally remove the old member. Once we add new member, baseline or incremental resync will start. Remove the old member will cause nuraft mesg to exit the group and we periodically gc the destroyed group. Made the repl dev base test common so that both tests files can use. Tests by default create repl group with num_replica's. Dynamic tests create additional spare replica's which can be added to the test dynamically by calling replace member.

Sealer is a special consumer that provides information regarding where the cp is up to. It will be the first one during cp switch over , as a conservative marker of everything before or equals to this point, should be in current cp, possibly some consumer are above this point which is fine. And Sealer is the last one during cp flush after all other services flushed successfully. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Draft cp repl_svc

)

Add cert watcher and restart raft service when cert is updated

Previous code can overflow the io_size, i.e remaining_io_size -= sub_io_size; where sub_io_size > remaining_io_size, and remaining_io_size is unsigned which will be a huge number, takes ages to finish. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

We see no space error in write_to_full ut, might due to when left space == max_wrt_sz and we take max_wrt_sz, however two extra blks are needed. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Add replica member info with name, priority and id. Use replica member info for replace member api and listener callbacks.

Signed-off-by: Jilong Kou <jkou@ebay.com>

Concurrent writes to m_down_buffers may cause data inconsistency. Add a mutex lock to IndexBuffer as well as extracting add/remove operations into member functions to make the vector thread-safe. Signed-off-by: Jilong Kou <jkou@ebay.com>

* Implement GC_REPL_REQ Based on DSN to Prevent Resource Leaks This commit introduces a mechanism to garbage collect (GC) replication requests (rreqs) that may hang indefinitely, thereby consuming memory and disk resources unnecessarily. These rreqs can enter a hanging state under several circumstances, as outlined below: 1. Scenario with Delayed Commit: - Follower F1 receives LSN 100 and DSN 104 from Leader L1 and takes longer than the raft timeout to precommit/commit it. - L1 resends LSN 100, causing F1 to fetch the data again. Since LSN 100 was committed in a previous attempt, this log entry is skipped, leaving the rreq hanging indefinitely. 2. Scenario with Leader Failure Before Data Completion: - Follower F1 receives LSN 100 from L1, but before all data is fetched/pushed, L1 fails and L2 becomes the new leader. - L2 resends LSN 100 with L2 as the new originator. F1 proceeds with the new rreq and commits it, but the initial rreq from L1 hangs indefinitely as it cannot fetch data from the new leader L2. 3. Scenario with Leader Failure After Data Write: - Follower F1 receives data (DSN 104) from L1 and writes it. Before the log of LSN 100 reaches F1, L1 fails and L2 becomes the new leader. - L2 resends LSN 100 to F1, and F1 fetches DSN 104 from L2, leaving the original rreq hanging. This garbage collection process cleans up based on DSN. Any rreqs in `m_repl_key_req_map`, whose DSN is already committed (`rreq->dsn < repl_dev->m_next_dsn`), will be GC'd. This is safe on the follower side, as the follower updates `m_next_dsn` during commit. Any DSN below `cur_dsn` should already be committed, implying that the rreq should already be removed from `m_repl_key_req_map`. On the leader side, since `m_next_dsn` is updated when sending out the proposal, it is not safe to clean up based on `m_next_dsn`. Therefore, we explicitly skip the leader in this GC process. Skipping localize raft logs we already committed. Leader may send duplicate raft logs, if we localize them unconditionally duplicate data will be written to chunk during fetch_data. It is safe for us to skip those logs that already committed, there is no way those LSN can be over-written. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Data buffer persists in memory until rreq is committed or rolled back. This approach poses issues during recovery. As new data arrives via push_data and is written to disk, it remains in memory for an extended period until the replica catches up and commits the rreq. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

* add rollback on state machine --------- Signed-off-by: yawzhang <yawzhang@ebay.com>

* PushData only pushed to active followers. If a follower is lagging too far, do not flood it with data from new IOs (new rreq, new LSNs) , reserve the capability for catching up, that follower can request data via FetchData. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

when follower hits some error before appending log entries, it will set batch_size_hint_in_bytes to -1 to ask leader do not send more log entries in the next append_log_req. https://github.com/eBay/NuRaft/blob/eabdeeda538a27370943f79a2b08b5738b697ac3/src/handle_append_entries.cxx#L760 in nuobject case , if a new member is added to a raft group and it tries to append create_shard log entry , which will try to alllocate block from the chunks of the pg, before the create_pg log is committed , which will allocated chunks to this pg, and error will happen and the log batch containing create_shard log entry will be wholy rejected and set batch_size_hint_in_bytes to -1 in the response to leader. this pr aims to set the log count in the next batch sent to follower to 1, so that: if the create_pg and create_shard are in the same log batch , the pr will first reject this log batch and leader will send only create_pg in the next batch , which will be accepted by follower , since it will only create this pg. if if the create_pg and create_shard are not in the same log batch, and create_shard is trying to allocate block before the pg it created(chunks of this pg is alllocated), then , with this pr, follower will reject this batch so that it will give more time to creating pg. create_shard log will be resent in the next batch , and at that moment pg has probably already been successfully be created.

We dont need to panic in this case, fetchData can handle this. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Add application_hint to the blk_alloc_hints structure. This change addresses the need for certain users of homestore, such as homeobject, to pass additional hints. The application_hint can be used to specify behavior in the select_chunk interface.

* fix for put remove crash test bug * Fix repair_links during crash recovery * update root when the current root splits during repair_links * add more trace logging for btree recovery * call repair links on the buffer which is priuned due to zero down buffers * upgrade conan version * fix an issue in the prune buffer code * Relax the sanity check condition about child key and previous parent key comparision * add more comments to the code --------- Co-authored-by: Ravi Nagarjun Akella <raakella1@$HOSTNAME>

In the disk replacement scenario, we will replace a bad device with a brand new device which needs to be formatted. Formatting has three logical parts: first block: This can be recovered from other existing devices. pdev info: 2.1 pdev header 2.2 format chunk slots vdev info: Vdev info can be recovered from other existing devices, and the missing chunks can be inferred. Add new pdev into the vdev as needed. Note: The pdev_id is monotonically increasing. The chunk_id might be reused, so custom_chunk_selector should pay more attention to it. Expose pdev_name in VChunk, so users can get the logic entity to physical devices map which is helpful for admin and operators.

Record cur_pdev_id in first block header. gen_number in first block header is used to track the value change of the first block attributes. Compare gen_number to identify new first block header in load_devices() and increase gen_number every time the attributes(like cur_pdev_id) changed. The gen_number conflicts might arise due to interruption during sequential commit_formatting, but it can be identified and corrected to the latest one next startup.

Add reactor for cp manager timer. In high intensive IO tests, cp timer's are not executed.

* Add index chunk selector

* Issue:771 Expose Fault Containment Service

…ay#778)

…overy

flush a buffer only if it is dirtied in the current cp

set buf state clean after getting its down buf during cp flush

* UTs for simulating tombstone and GC * Remove retry for now and leave it for future decision

1. Introduce multiple index so that homestore can actually have different types of Index stores. 2. Introduce a new Btree called CopyOnWrite Btree, instead of inplace btree where the btree pages are not written in place, but on different location, but maintain a map. 3. Make the public interfaces to be very concise (having a BtreeBase and put that in the implementation) 4. Simplified the btree apis 5. Used latest sisl 13.x with REGISTER_LOG_MODS 6. Added cow btree crash test, updated other tests to ensure pass 7. Moved existing implementation of btree to inplace btree 8. Updated the build and dependency build github CI/CD pipeline 9. Made replication as an optional module

sanebay and others added 30 commits September 25, 2024 15:17

Start data service after log replay done.

81a80f2

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Flushing log after data written.

d445658

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Workaround: temporary disbale assert of dirty_buf_cnt.

15741a8

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Merge pull request eBay#557 from xiaoxichen/fix_cp

b5a9191

Draft cp repl_svc

Add raft commit quorum for replace member if two members down. (eBay#559

5e6bf9d

)

Add cert watcher

87963c3

upgrade version

d35f75e

fix nit

f88317d

Merge pull request eBay#566 from yuwmao/raft

9be2a49

Add cert watcher and restart raft service when cert is updated

Fix read_io in dataservice test.

d90b54d

Previous code can overflow the io_size, i.e remaining_io_size -= sub_io_size; where sub_io_size > remaining_io_size, and remaining_io_size is unsigned which will be a huge number, takes ages to finish. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

FIX wbcache for put and modify long running index (eBay#567)

b182e7f

Count in ovf headers.

9a06c05

We see no space error in write_to_full ut, might due to when left space == max_wrt_sz and we take max_wrt_sz, however two extra blks are needed. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Reduce logs (eBay#571)

634047c

Change replace member api signature.

a5b2969

Add replica member info with name, priority and id. Use replica member info for replace member api and listener callbacks.

Add package version and show in log (eBay#575)

8a80eef

add chunksize to vchunk interface (eBay#572)

a7a9fe5

Add index CR UT for basic merge (eBay#556)

60eea4a

Signed-off-by: Jilong Kou <jkou@ebay.com>

Add additional tests for replace member (eBay#574)

c4efe11

add rollback on state machine add open Leader_Restart ut (eBay#585)

f8426dc

* add rollback on state machine --------- Signed-off-by: yawzhang <yawzhang@ebay.com>

Set min_log_gap_to_join to max_int32 and enabled new_joiner_type

6f6b4fb

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Checking received data size and reject if not match.

f83679a

We dont need to panic in this case, fetchData can handle this. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Disable dynamic repl ut temporarily. (eBay#593)

da19fe4

raakella1 and others added 23 commits July 11, 2025 12:41

Refine the algorithm to calculate pdev chunks on vdev (eBay#772)

1eb131b

Fix monitor_replace_member_replication_status (eBay#774)

2592b01

Run cp mgr timer in its own reactor. (eBay#761)

8cf9553

Add reactor for cp manager timer. In high intensive IO tests, cp timer's are not executed.

Add index chunk selector hb (eBay#768)

03dd4ce

* Add index chunk selector

Issue: 771 Fault Containment Service (eBay#773)

33362c1

* Issue:771 Expose Fault Containment Service

Fix negative metrics and return back 1.3 metrics (eBay#775)

be86e44

Fast sanity check of index table after recovery (eBay#767)

47512e1

Fix prefix merge (eBay#742)

bb8d07a

Fix table ordinal (eBay#777)

b0482c0

Adding more logs to check actual vdev cp flush and freeing blkids (eB…

3054646

…ay#778)

flush a buffer only if it is dirtied in the current cp

f2b02c0

add comment explaining the change

06e6c49

skip sanity check for the new bufs which are not considered after rec…

9c39af4

…overy

Merge pull request eBay#781 from raakella1/wbcache_cp

53410c2

flush a buffer only if it is dirtied in the current cp

Add metrics for cp and blk alloc latency. (eBay#782)

3ffa892

set buf state clean after getting its down buf during cp flush

f7570e0

Merge pull request eBay#783 from raakella1/buf_status

0f1ab23

set buf state clean after getting its down buf during cp flush

return no-op if no chunk available (eBay#785)

a4542bd

UTs for simulating tombstone and GC (eBay#737)

5c82e7d

* UTs for simulating tombstone and GC * Remove retry for now and leave it for future decision

hkadayam force-pushed the clean-integration branch 2 times, most recently from 86fffae to a99ad16 Compare August 8, 2025 15:13

Merge branch 'master' into clean-integration

b15a6dc

hkadayam force-pushed the clean-integration branch from a99ad16 to b15a6dc Compare August 8, 2025 15:17

hkadayam changed the title ~~Clean integration~~ Merge with upstream and made replication as an optional support Aug 8, 2025

hkadayam merged commit aadde9f into master Aug 8, 2025
8 checks passed

hkadayam deleted the clean-integration branch February 4, 2026 03:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge with upstream and made replication as an optional support #8

Merge with upstream and made replication as an optional support #8

Uh oh!

hkadayam commented Aug 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Merge with upstream and made replication as an optional support #8

Merge with upstream and made replication as an optional support #8

Uh oh!

Conversation

hkadayam commented Aug 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants