-
Notifications
You must be signed in to change notification settings - Fork 8
Use varint length field for last_path encoding to support longer GCP object names #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use varint length field for last_path encoding to support longer GCP object names #72
Conversation
|
Hello, @r-uehara0219 Thank you for proposing this PR. We will check this PR. Please wait for a while. By the way, We are Looking for long-term maintainers around the Embulk eco-system. If you (and your team) are willing to raise your hand for the community on the total work of GCS plugins, you're welcome. |
dmikurube
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution, @r-uehara0219 !
This part seems to come from #47, which was to support Japanese characters, but the 128 limitation was a kind of tentative/ad-hoc one.
This pull-request fixes the ad-hoc one. It basically looks good to me! Then approved, but left a couple of minor comments. Could you address them?
|
|
||
| int utf8EncodeLength = utf8.length; | ||
| if (utf8EncodeLength >= 128) { | ||
| if (utf8EncodeLength >= 65_535) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's understandable that this base64Encode method can handle 65536 bytes at most by itself (regardless of the limitation for task.getLastPath()), but it is a little bit confusing for code readers that it has a different limitation from task.getLastPath().
Can you align this with task.getLastPath() and leave a comment to explain why limited by that number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aligned byte limit to 1024 bytes and added a comment.
db7e7b9
| // see: https://protobuf.dev/programming-guides/encoding/#varints | ||
| private static byte[] encodeVarint(int value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that this method is indirectly tested via base64Encode, but for sure and for readability, can you add some more tests directly for this encodeVarint method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests were added. I use reflect to test private methods, but please point out if this should not be used as repository policy.
9f691e6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your consideration! I think that the method can be just package-private (no visibility modifier) so that the method is visible from test classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to package-private. Thanks for the comment!
dea553b
Co-authored-by: Dai MIKURUBE <dmikurube@users.noreply.github.com>
…/embulk-input-gcs into feature/varint-length-field
|
Hmm, the tests are failing on GitHub Actions. Some of them might be for different reasons (ex. GCP credentials?), but at least |
|
Sorry, only some tests were run in my local environment and the test results were not properly verified. Please wait while I fix the test. |
The test EXPECTED was incorrect and has been corrected. |
|
Thanks for working on the fix! Confirmed that |
|
Most of the errors seem to occur when reading the P12 key in AuthUtils.java:98. @dmikurube |
|
They're updated 3 years ago, but #74 passed just a couple of minutes ago... Hmm, could be a kind of permission issues by external contributors? Let me see. |
|
Yeah, that's the cause. We should have used the GitHub https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/ We've already set this repo not to run any Workflow against a fork pull-req automatically without approval, then using Okay, let me merge this pull-req now, and let's see the Thanks for your contribution! |

fix #71
Overview
This PR updates the last_path encoding mechanism in embulk-input-gcs. Instead of using a fixed 1-byte length field (limiting paths to 127 bytes), the new implementation uses a varint length field, which can span 1–2 bytes, to represent the length of the UTF-8 encoded string.
Why Is It Necessary?
GCP object names can be up to 1024 bytes in length. The previous implementation’s 1-byte length field restricted last_path to 128 characters, causing errors when handling longer object names. Switching to a varint length field removes this limitation and ensures that the plugin can support all valid GCP object names.