Skip to content

Conversation

@hkadayam
Copy link
Owner

@hkadayam hkadayam commented Aug 8, 2025

No description provided.

sanebay and others added 30 commits September 25, 2024 15:17
When replacing a member, add the new member, sync raft log
for replace and finally remove the old member. Once we add
new member, baseline or incremental resync will start.
Remove the old member will cause nuraft mesg to exit
the group and we periodically gc the destroyed group.
Made the repl dev base test common so that both tests files
can use. Tests by default create repl group with num_replica's.
Dynamic tests create additional spare replica's which can be
added to the test dynamically by calling replace member.
Sealer is a special consumer that provides information regarding where the cp is up to.
It will be the first one during cp switch over , as a conservative marker of everything
before or equals to this point, should be in current cp, possibly some consumer are above this point which is fine.
And Sealer is the last one during cp flush after all other services flushed successfully.

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Add cert watcher and restart raft service when cert is updated
Previous code can overflow the io_size, i.e

remaining_io_size -= sub_io_size;

where sub_io_size > remaining_io_size, and
remaining_io_size is unsigned which will be
a huge number, takes ages to finish.

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
We see no space error in write_to_full ut, might
due to when left space == max_wrt_sz and we take max_wrt_sz,
however two extra blks are needed.

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Add replica member info with name, priority and id.
Use replica member info for replace member api and listener callbacks.
Signed-off-by: Jilong Kou <jkou@ebay.com>
Concurrent writes to m_down_buffers may cause data inconsistency.
Add a mutex lock to IndexBuffer as well as extracting add/remove
operations into member functions to make the vector thread-safe.

Signed-off-by: Jilong Kou <jkou@ebay.com>
* Implement GC_REPL_REQ Based on DSN to Prevent Resource Leaks

This commit introduces a mechanism to garbage collect (GC) replication requests
(rreqs) that may hang indefinitely, thereby consuming memory and disk resources
unnecessarily. These rreqs can enter a hanging state under several
circumstances, as outlined below:

1. Scenario with Delayed Commit:
   - Follower F1 receives LSN 100 and DSN 104 from Leader L1 and takes longer
     than the raft timeout to precommit/commit it.
   - L1 resends LSN 100, causing F1 to fetch the data again. Since LSN 100 was
     committed in a previous attempt, this log entry is skipped, leaving the
     rreq hanging indefinitely.

2. Scenario with Leader Failure Before Data Completion:
   - Follower F1 receives LSN 100 from L1, but before all data is fetched/pushed,
     L1 fails and L2 becomes the new leader.
   - L2 resends LSN 100 with L2 as the new originator. F1 proceeds with the new
     rreq and commits it, but the initial rreq from L1 hangs indefinitely as it
     cannot fetch data from the new leader L2.

3. Scenario with Leader Failure After Data Write:
   - Follower F1 receives data (DSN 104) from L1 and writes it. Before the log of
     LSN 100 reaches F1, L1 fails and L2 becomes the new leader.
   - L2 resends LSN 100 to F1, and F1 fetches DSN 104 from L2, leaving the
     original rreq hanging.

This garbage collection process cleans up based on DSN. Any rreqs in
`m_repl_key_req_map`, whose DSN is already committed (`rreq->dsn <
repl_dev->m_next_dsn`), will be GC'd. This is safe on the follower side, as the
follower updates `m_next_dsn` during commit. Any DSN below `cur_dsn` should
already be committed, implying that the rreq should already be removed from
`m_repl_key_req_map`.

On the leader side, since `m_next_dsn` is updated when sending out the proposal,
it is not safe to clean up based on `m_next_dsn`. Therefore, we explicitly skip
the leader in this GC process.



Skipping localize raft logs we already committed.

Leader may send duplicate raft logs, if we localize them
unconditionally duplicate data will be written to chunk during
fetch_data.

It is safe for us to skip those logs that already committed,
there is no way those LSN can be over-written.

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Data buffer persists in memory until rreq is committed or rolled back.

This approach poses issues during recovery. As new data arrives via
push_data and is written to disk, it remains in memory for an extended
period until the replica catches up and commits the rreq.

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
* add rollback on state machine

---------

Signed-off-by: yawzhang <yawzhang@ebay.com>
* PushData only pushed to active followers.

If a follower is lagging too far, do not flood it with data
from new IOs (new rreq, new LSNs) ,  reserve the capability
for catching up,  that follower can request data via FetchData.

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
when follower hits some error before appending log entries, it will set batch_size_hint_in_bytes to -1 to ask leader do not send more log entries in the next append_log_req.
https://github.com/eBay/NuRaft/blob/eabdeeda538a27370943f79a2b08b5738b697ac3/src/handle_append_entries.cxx#L760

in nuobject case , if a new member is added to a raft group and it tries to append create_shard log entry , which will try to alllocate block from the chunks of the pg, before the create_pg log is committed , which will allocated chunks to this pg, and error will happen and the log batch containing create_shard log entry will be wholy rejected and set batch_size_hint_in_bytes to -1 in the response to leader.

this pr aims to set the log count in the next batch sent to follower to 1, so that:

if the create_pg and create_shard are in the same log batch , the pr will first reject this log batch and leader will send only create_pg in the next batch , which will be accepted by follower , since it will only create this pg.

if if the create_pg and create_shard are not in the same log batch, and create_shard is trying to allocate block before the pg it created(chunks of this pg is alllocated), then , with this pr, follower will reject this batch so that it will give more time to creating pg. create_shard log will be resent in the next batch , and at that moment pg has probably already been successfully be created.
We dont need to panic in this case, fetchData can handle this.

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Add application_hint to the blk_alloc_hints structure. This change addresses the need for certain users of homestore, such as homeobject, to pass additional hints. The application_hint can be used to specify behavior in the select_chunk interface.
raakella1 and others added 23 commits July 11, 2025 12:41
* fix for put remove crash test bug

* Fix repair_links during crash recovery

* update root when the current root splits during repair_links

* add more trace logging for btree recovery

* call repair links on the buffer which is priuned due to zero down buffers

* upgrade conan version

* fix an issue in the prune buffer code

* Relax the sanity check condition about child key and previous parent key comparision

* add more comments to the code

---------

Co-authored-by: Ravi Nagarjun Akella <raakella1@$HOSTNAME>
In the disk replacement scenario, we will replace a bad device with a brand new device which needs to be formatted.
Formatting has three logical parts:

first block: This can be recovered from other existing devices.
pdev info:
2.1 pdev header
2.2 format chunk slots
vdev info: Vdev info can be recovered from other existing devices, and the missing chunks can be inferred. Add new pdev into the vdev as needed.
Note:
The pdev_id is monotonically increasing.
The chunk_id might be reused, so custom_chunk_selector should pay more attention to it.

Expose pdev_name in VChunk, so users can get the logic entity to physical devices map which is helpful for admin and operators.
Record cur_pdev_id in first block header.
gen_number in first block header is used to track the value change of the first block attributes. Compare gen_number to identify new first block header in load_devices() and increase gen_number every time the attributes(like cur_pdev_id) changed.
The gen_number conflicts might arise due to interruption during sequential commit_formatting, but it can be identified and corrected to the latest one next startup.
Add reactor for cp manager timer. In high intensive IO tests,
cp timer's are not executed.
* Add index chunk selector
* Issue:771 Expose Fault Containment Service
flush a buffer only if it is dirtied in the current cp
set buf state clean after getting its down buf during cp flush
* UTs for simulating tombstone and GC

* Remove retry for now and leave it for future decision
1. Introduce multiple index so that homestore can actually have
different types of Index stores.

2. Introduce a new Btree called CopyOnWrite Btree, instead of inplace btree
where the btree pages are not written in place, but on different location, but
maintain a map.

3. Make the public interfaces to be very concise (having a BtreeBase and put that
in the implementation)

4. Simplified the btree apis

5. Used latest sisl 13.x with REGISTER_LOG_MODS

6. Added cow btree crash test, updated other tests to ensure pass

7. Moved existing implementation of btree to inplace btree

8. Updated the build and dependency build github CI/CD pipeline

9. Made replication as an optional module
@hkadayam hkadayam force-pushed the clean-integration branch 2 times, most recently from 86fffae to a99ad16 Compare August 8, 2025 15:13
@hkadayam hkadayam force-pushed the clean-integration branch from a99ad16 to b15a6dc Compare August 8, 2025 15:17
@hkadayam hkadayam changed the title Clean integration Merge with upstream and made replication as an optional support Aug 8, 2025
@hkadayam hkadayam merged commit aadde9f into master Aug 8, 2025
8 checks passed
@hkadayam hkadayam deleted the clean-integration branch February 4, 2026 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.