Skip to content

Conversation

@r-uehara0219
Copy link
Contributor

fix #71

Overview

This PR updates the last_path encoding mechanism in embulk-input-gcs. Instead of using a fixed 1-byte length field (limiting paths to 127 bytes), the new implementation uses a varint length field, which can span 1–2 bytes, to represent the length of the UTF-8 encoded string.

Why Is It Necessary?

GCP object names can be up to 1024 bytes in length. The previous implementation’s 1-byte length field restricted last_path to 128 characters, causing errors when handling longer object names. Switching to a varint length field removes this limitation and ensures that the plugin can support all valid GCP object names.

@r-uehara0219 r-uehara0219 requested a review from a team as a code owner March 7, 2025 08:21
@hiroyuki-sato
Copy link
Member

Hello, @r-uehara0219 Thank you for proposing this PR. We will check this PR. Please wait for a while.

By the way, We are Looking for long-term maintainers around the Embulk eco-system.

If you (and your team) are willing to raise your hand for the community on the total work of GCS plugins, you're welcome.

@dmikurube dmikurube added this to the v0.5.1 milestone Mar 13, 2025
Copy link
Member

@dmikurube dmikurube left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, @r-uehara0219 !

This part seems to come from #47, which was to support Japanese characters, but the 128 limitation was a kind of tentative/ad-hoc one.

This pull-request fixes the ad-hoc one. It basically looks good to me! Then approved, but left a couple of minor comments. Could you address them?


int utf8EncodeLength = utf8.length;
if (utf8EncodeLength >= 128) {
if (utf8EncodeLength >= 65_535) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's understandable that this base64Encode method can handle 65536 bytes at most by itself (regardless of the limitation for task.getLastPath()), but it is a little bit confusing for code readers that it has a different limitation from task.getLastPath().

Can you align this with task.getLastPath() and leave a comment to explain why limited by that number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligned byte limit to 1024 bytes and added a comment.
db7e7b9

Comment on lines 116 to 117
// see: https://protobuf.dev/programming-guides/encoding/#varints
private static byte[] encodeVarint(int value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this method is indirectly tested via base64Encode, but for sure and for readability, can you add some more tests directly for this encodeVarint method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests were added. I use reflect to test private methods, but please point out if this should not be used as repository policy.
9f691e6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your consideration! I think that the method can be just package-private (no visibility modifier) so that the method is visible from test classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to package-private. Thanks for the comment!
dea553b

@dmikurube
Copy link
Member

Hmm, the tests are failing on GitHub Actions. Some of them might be for different reasons (ex. GCP credentials?), but at least testBase64 is failing for the new case. Can you take a look?

@r-uehara0219
Copy link
Contributor Author

r-uehara0219 commented Mar 14, 2025

Sorry, only some tests were run in my local environment and the test results were not properly verified. Please wait while I fix the test.
(readme requires GCP_P12_KEYFILE to be specified, but what is actually needed was GCP_PRIVATE_KEYFILE😕 I will revise the readme and commit later.)

@r-uehara0219
Copy link
Contributor Author

r-uehara0219 commented Mar 14, 2025

Hmm, the tests are failing on GitHub Actions. Some of them might be for different reasons (ex. GCP credentials?), but at least testBase64 is failing for the new case. Can you take a look?

The test EXPECTED was incorrect and has been corrected.
df160ff

@dmikurube
Copy link
Member

Thanks for working on the fix!

Confirmed that testBase64 is fixed, but other failures are still there. I at first wondered if it could be a kind of expiration in access tokens (to GCP), but #73 was working good just yesterday. Then I guess that the cause of the failure is in this pull-request... Could you take a deeper look into it?

@r-uehara0219
Copy link
Contributor Author

Most of the errors seem to occur when reading the P12 key in AuthUtils.java:98.

@dmikurube
Can you tell me when the GHA environment variable: GCP_PRIVATE_KEYFILE was last updated?

@dmikurube
Copy link
Member

They're updated 3 years ago, but #74 passed just a couple of minutes ago... Hmm, could be a kind of permission issues by external contributors? Let me see.

Screenshot from 2025-03-14 17-23-27

@dmikurube
Copy link
Member

Yeah, that's the cause. We should have used the GitHub pull_request_target event, instead of pull_request event, but it also had a security concern.

https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/

We've already set this repo not to run any Workflow against a fork pull-req automatically without approval, then using pull_request_target should be fine.

Okay, let me merge this pull-req now, and let's see the master branch. On the other hand, we'll change the GitHub Actions event in another pull-req.

Thanks for your contribution!

@dmikurube dmikurube merged commit 888f9f4 into embulk:master Mar 14, 2025
0 of 2 checks passed
@dmikurube
Copy link
Member

Passed. :) https://github.com/embulk/embulk-input-gcs/actions/runs/13852924626/job/38763460455

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Support longer GCS object names for last_path encoding

3 participants