Skip to content

Conversation

@artsiom-bondarau
Copy link

Implemented directory synchronization using the hash strategy

When using this tool, I encountered the following problem:

  1. There are two build machines: b1 and b2. Each machine has its own local build cache. Both machines upload data to the same MinIO/S3 bucket.

  2. During the build on machine b1, files are generated with an mtime of 13:00. These files are uploaded to an empty bucket. The synchronization completes successfully, and the directory and bucket remain in a consistent state.

    We are particularly interested in the file test1.txt. It is 6 B size and was created at 13:00.
    test1.txt

  3. Some time passes, and a new build is triggered on machine b2 from a new branch. The files are new and have an mtime of 14:00. Some files are identical in terms of size and name, while others are different. However, all files are uploaded since they have a new mtime.

    The file test1.txt is again 6 B but has different content. It was uploaded because it was created after 13:00.
    test1.txt

  4. More time passes, and another build is triggered on b1, similar to the step 2 one. Instead of generating files, we attempt to synchronize the cache.

    The file test1.txt is identical to the one from step 2, but it will not be uploaded because its size is the same as the one uploaded in step 3. Additionally, its creation time is 13:00, which is not greater than 13:00. As a result, we end up in an inconsistent state and fail to upload an important file.

To avoid modifying our build workflow or introducing workarounds, I decided to improve s5cmd by adding hash-based synchronization. Files are now compared based on size and hash. If the hash or file size in src and dst are different, it will be updated. This simple rule solves our issue.

S3 supports ETag, which stores the MD5 hash. Computing the MD5 checksum for a local file is easy.


What has been implemented?

  1. Added the HashStrategy. The principle is simple:

    • If file sizes differ, the file is copied.
    • If sizes are the same but the hash (ETag/local MD5) differs, the file is copied.
    • If both size and MD5 checksum are the same, no upload occurs.
  2. Added the --hash-only flag, which works only with the sync operation and enables the HashStrategy.

  3. Changed the mechanism for spawning goroutines to check whether synchronization is needed. Previously, a single goroutine was used. That approach was acceptable but degraded performance with a large number of files.

    • Now, I use the global numworkers parameter and create as many goroutines as this value.
  4. Stored the ETag value when reading files from remote or local storage objects.

  5. Added tests for hash-based synchronization.


Important Considerations

  1. I used the numWorkers parameter, but I’m unsure if this is the right approach.

    • This parameter is global and, based on its description, is used for other operations.
    • Additionally, it is used with the parallel object.
    • Perhaps it would be better to introduce a separate parameter for sync (with the --hash-only flag) and modify the goroutine creation process to use parallel.
  2. The HashStrategy relies not only on hash comparison but also on file size verification.

    • This might be misleading, so renaming it to SizeAndHashStrategy might be more appropriate.
  3. The hash could be computed during the initial file read (fs.go).

    • However, this might increase execution time for other commands without any practical benefit.
  4. I did not run performance tests using bench.py.

    • I tested performance by building s5cmd from my fork and uploading files.
    • bench.py does not include tests for sync, and I don’t have the resources to test with a large file (300GB).
    • If performance testing is critical, I can try modifying the benchmark and testing with a limited file set (10GB).
  5. I ran make check and found no warnings or errors.

  6. I am not a Go developer and am unfamiliar with common Go patterns and best practices.

    • I focused on solving my problem and then sharing my solution with others.
    • I would appreciate it if someone with more Go experience and knowledge of s5cmd could review the implementation and highlight potential issues.

@artsiom-bondarau artsiom-bondarau requested a review from a team as a code owner March 25, 2025 14:38
@artsiom-bondarau artsiom-bondarau requested review from denizsurmeli and igungor and removed request for a team March 25, 2025 14:38
@kulmam92
Copy link

kulmam92 commented Apr 8, 2025

When this will be merged?

@artsiom-bondarau
Copy link
Author

artsiom-bondarau commented Apr 16, 2025

When this will be merged?

Hi!

I'm not sure that I can have any impact on the merge speed of this feature. MR should be looked at the reviewers. I hope the feature will be available in the main repository as soon as possible. Right now we are using our own build built from fork.

@SergeyLVG
Copy link

@igungor @denizsurmeli
Hey folks — just a gentle ping on PR #799. Looks like a solid improvement for sync behavior, might be worth a look when you get a chance. Cheers!

@rmanus
Copy link

rmanus commented Sep 9, 2025

Your proposal will most likely not work for large files that use multipart upload for transfer. As discussed here:
aws/aws-sdk-java-v2#4705

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants