feature: add sync by hash #799
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implemented directory synchronization using the hash strategy
When using this tool, I encountered the following problem:
There are two build machines:
b1andb2. Each machine has its own local build cache. Both machines upload data to the same MinIO/S3 bucket.During the build on machine
b1, files are generated with anmtimeof13:00. These files are uploaded to an empty bucket. The synchronization completes successfully, and the directory and bucket remain in a consistent state.We are particularly interested in the file
test1.txt. It is 6 B size and was created at 13:00.test1.txt
Some time passes, and a new build is triggered on machine
b2from a new branch. The files are new and have anmtimeof14:00. Some files are identical in terms of size and name, while others are different. However, all files are uploaded since they have a newmtime.The file
test1.txtis again 6 B but has different content. It was uploaded because it was created after13:00.test1.txt
More time passes, and another build is triggered on
b1, similar to the step 2 one. Instead of generating files, we attempt to synchronize the cache.The file
test1.txtis identical to the one from step 2, but it will not be uploaded because its size is the same as the one uploaded in step 3. Additionally, its creation time is 13:00, which is not greater than13:00. As a result, we end up in an inconsistent state and fail to upload an important file.To avoid modifying our build workflow or introducing workarounds, I decided to improve
s5cmdby adding hash-based synchronization. Files are now compared based on size and hash. If the hash or file size insrcanddstare different, it will be updated. This simple rule solves our issue.S3 supports ETag, which stores the MD5 hash. Computing the MD5 checksum for a local file is easy.
What has been implemented?
Added the
HashStrategy. The principle is simple:Added the
--hash-onlyflag, which works only with thesyncoperation and enables theHashStrategy.Changed the mechanism for spawning goroutines to check whether synchronization is needed. Previously, a single goroutine was used. That approach was acceptable but degraded performance with a large number of files.
numworkersparameter and create as many goroutines as this value.Stored the
ETagvalue when reading files from remote or local storage objects.Added tests for hash-based synchronization.
Important Considerations
I used the
numWorkersparameter, but I’m unsure if this is the right approach.parallelobject.sync(with the--hash-onlyflag) and modify the goroutine creation process to useparallel.The
HashStrategyrelies not only on hash comparison but also on file size verification.SizeAndHashStrategymight be more appropriate.The hash could be computed during the initial file read (
fs.go).I did not run performance tests using
bench.py.s5cmdfrom my fork and uploading files.bench.pydoes not include tests forsync, and I don’t have the resources to test with a large file (300GB).I ran
make checkand found no warnings or errors.I am not a Go developer and am unfamiliar with common Go patterns and best practices.
s5cmdcould review the implementation and highlight potential issues.