feature: add sync by hash #799

artsiom-bondarau · 2025-03-25T14:38:25Z

Implemented directory synchronization using the hash strategy

When using this tool, I encountered the following problem:

There are two build machines: b1 and b2. Each machine has its own local build cache. Both machines upload data to the same MinIO/S3 bucket.
During the build on machine b1, files are generated with an mtime of 13:00. These files are uploaded to an empty bucket. The synchronization completes successfully, and the directory and bucket remain in a consistent state.

We are particularly interested in the file test1.txt. It is 6 B size and was created at 13:00.
test1.txt
Some time passes, and a new build is triggered on machine b2 from a new branch. The files are new and have an mtime of 14:00. Some files are identical in terms of size and name, while others are different. However, all files are uploaded since they have a new mtime.

The file test1.txt is again 6 B but has different content. It was uploaded because it was created after 13:00.
test1.txt
More time passes, and another build is triggered on b1, similar to the step 2 one. Instead of generating files, we attempt to synchronize the cache.

The file test1.txt is identical to the one from step 2, but it will not be uploaded because its size is the same as the one uploaded in step 3. Additionally, its creation time is 13:00, which is not greater than 13:00. As a result, we end up in an inconsistent state and fail to upload an important file.

To avoid modifying our build workflow or introducing workarounds, I decided to improve s5cmd by adding hash-based synchronization. Files are now compared based on size and hash. If the hash or file size in src and dst are different, it will be updated. This simple rule solves our issue.

S3 supports ETag, which stores the MD5 hash. Computing the MD5 checksum for a local file is easy.

What has been implemented?

Added the HashStrategy. The principle is simple:
- If file sizes differ, the file is copied.
- If sizes are the same but the hash (ETag/local MD5) differs, the file is copied.
- If both size and MD5 checksum are the same, no upload occurs.
Added the --hash-only flag, which works only with the sync operation and enables the HashStrategy.
Changed the mechanism for spawning goroutines to check whether synchronization is needed. Previously, a single goroutine was used. That approach was acceptable but degraded performance with a large number of files.
- Now, I use the global numworkers parameter and create as many goroutines as this value.
Stored the ETag value when reading files from remote or local storage objects.
Added tests for hash-based synchronization.

Important Considerations

I used the numWorkers parameter, but I’m unsure if this is the right approach.
- This parameter is global and, based on its description, is used for other operations.
- Additionally, it is used with the parallel object.
- Perhaps it would be better to introduce a separate parameter for sync (with the --hash-only flag) and modify the goroutine creation process to use parallel.
The HashStrategy relies not only on hash comparison but also on file size verification.
- This might be misleading, so renaming it to SizeAndHashStrategy might be more appropriate.
The hash could be computed during the initial file read (fs.go).
- However, this might increase execution time for other commands without any practical benefit.
I did not run performance tests using bench.py.
- I tested performance by building s5cmd from my fork and uploading files.
- bench.py does not include tests for sync, and I don’t have the resources to test with a large file (300GB).
- If performance testing is critical, I can try modifying the benchmark and testing with a limited file set (10GB).
I ran make check and found no warnings or errors.
I am not a Go developer and am unfamiliar with common Go patterns and best practices.
- I focused on solving my problem and then sharing my solution with others.
- I would appreciate it if someone with more Go experience and knowledge of s5cmd could review the implementation and highlight potential issues.

kulmam92 · 2025-04-08T15:36:57Z

When this will be merged?

artsiom-bondarau · 2025-04-16T11:43:21Z

When this will be merged?

Hi!

I'm not sure that I can have any impact on the merge speed of this feature. MR should be looked at the reviewers. I hope the feature will be available in the main repository as soon as possible. Right now we are using our own build built from fork.

SergeyLVG · 2025-04-18T07:15:45Z

@igungor @denizsurmeli
Hey folks — just a gentle ping on PR #799. Looks like a solid improvement for sync behavior, might be worth a look when you get a chance. Cheers!

rmanus · 2025-09-09T14:52:46Z

Your proposal will most likely not work for large files that use multipart upload for transfer. As discussed here:
aws/aws-sdk-java-v2#4705

artsiom-bondarau added 3 commits February 26, 2025 18:06

sync: add sync by hash

5b54cee

sync: update readme and sync

065eee9

sync: write tests and fix error message

9c77536

artsiom-bondarau requested a review from a team as a code owner March 25, 2025 14:38

artsiom-bondarau requested review from denizsurmeli and igungor and removed request for a team March 25, 2025 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature: add sync by hash #799

feature: add sync by hash #799

Uh oh!

artsiom-bondarau commented Mar 25, 2025

Uh oh!

kulmam92 commented Apr 8, 2025

Uh oh!

artsiom-bondarau commented Apr 16, 2025 •

edited

Loading

Uh oh!

SergeyLVG commented Apr 18, 2025

Uh oh!

rmanus commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feature: add sync by hash #799

Are you sure you want to change the base?

feature: add sync by hash #799

Uh oh!

Conversation

artsiom-bondarau commented Mar 25, 2025

Implemented directory synchronization using the hash strategy

What has been implemented?

Important Considerations

Uh oh!

kulmam92 commented Apr 8, 2025

Uh oh!

artsiom-bondarau commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SergeyLVG commented Apr 18, 2025

Uh oh!

rmanus commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

artsiom-bondarau commented Apr 16, 2025 •

edited

Loading