Skip to content

Conversation

@LuisSanchez25
Copy link
Collaborator

@LuisSanchez25 LuisSanchez25 commented Oct 2, 2024

What is the problem / what does the code in this PR do

Adds S3 functionality to strax

Can you briefly describe how it works?

We can now use an S3 storage system to store data

Can you give a minimal working example (or illustrate with a figure)?

import strax
st = strax.Context()
s3_storage = strax.S3Frontent()
st.storage = [s3_storage]

Please include the following if applicable:

  • Update the docstring(s)
  • Update the documentation
  • Tests to check the (new) code is working as desired.
  • Does it solve one of the open issues on github?

Please make sure that all automated tests have passed before asking for a review (you can save the PR as a draft otherwise).

@dachengx
Copy link
Collaborator

dachengx commented Oct 2, 2024

Thanks, @LuisSanchez25. Would it be more appropriate to put these new StorageFrontend and StorageBackend to straxen like RucioRemoteFrontend https://github.com/XENONnT/straxen/blob/a412b4382eed3277fd599c105d69855df38bc7f8/straxen/storage/rucio_remote.py#L29?

@dachengx
Copy link
Collaborator

dachengx commented Oct 2, 2024

And I generally think this is a super good idea! We should benefit from the market resources!

@LuisSanchez25
Copy link
Collaborator Author

Hey @dachengx I am a bit unfamiliar with why the decision was made to put the Rucio frontend in straxen rather than in strax, that feels like a tool that would be beneficial for others outside of XENON right? So to me it might make more sense to move the Rucio storage to straxen but maybe I am just missing something.

Thanks! I am still in the process of fully testing this, I think it still needs some tweaks but I should have a working prototype soon!

@dachengx
Copy link
Collaborator

dachengx commented Oct 2, 2024

@LuisSanchez25 I think by design, strax is for the prototypes of all classes, like plugins and storage. straxen will inherit the classes and make the functions more specific. I think this is why they put the RucioRemoteFrontend to straxen.

Copy link
Collaborator

@dachengx dachengx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @LuisSanchez25 . I did not look into s3.py deeply because I see there are commented-out codes so maybe it is not finished?

I still would insist on moving the S3 functionality to straxen. strax should be just basic usage and processor etc.

You should not change save_file but do what should be done only in _save_chunk function.

@coveralls
Copy link

coveralls commented May 6, 2025

Coverage Status

coverage: 87.217% (-1.7%) from 88.958%
when pulling 4bb5ddb on s3_protocol
into cd79adc on master.

@LuisSanchez25 LuisSanchez25 requested a review from dachengx May 7, 2025 14:42
@LuisSanchez25 LuisSanchez25 requested a review from Copilot June 4, 2025 20:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds S3 storage support to Strax, enabling reading and writing data from/to S3 buckets.

  • Introduces new S3-based load_file_from_s3 and _save_file_to_s3 helpers integrated into load_file/save_file
  • Adds an stx_file_parser utility for parsing Strax file names
  • Updates tests and dependencies to include boto3 and a basic S3 write test

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_storage.py Adds test_write_data_s3 stub to verify S3 writes
strax/utils.py New stx_file_parser function
strax/storage/files.py Annotated run_metadata signature
strax/io.py Extended load_file/save_file for S3, new helpers
strax/init.py Exposed S3 frontend in package exports
pyproject.toml Added boto3 dependency
Comments suppressed due to low confidence (5)

strax/io.py:84

  • [nitpick] Adding positional parameters may break existing callers. Consider making bucket_name and is_s3_path keyword-only (e.g. *, bucket_name=None) and updating docstrings to reflect their usage.
def load_file(f, compressor, dtype, bucket_name=None, is_s3_path=False):

tests/test_storage.py:35

  • There’s no test coverage for the new _save_file_to_s3 or load_file_from_s3 paths. Add tests that simulate S3 interactions to ensure S3 upload/download works as expected.
def test_write_data_s3(self):

strax/utils.py:836

  • [nitpick] The docstring for stx_file_parser is brief and unclear about accepted formats and return structure. Expand it with examples of valid inputs and outputs.
def stx_file_parser(path: str):

tests/test_storage.py:39

  • The test checks is_configed against an empty string, but it likely returns a boolean or object. Make the condition explicit (e.g. if self.st.storage[0].is_configed:) or stub is_configed properly for the test.
if self.st.storage[0].is_configed != "":

strax/utils.py:835

  • stx_file_parser uses re.split without importing re, and if a ValueError is caught, file_data is undefined. Add import re and explicitly raise or return a default when parsing fails.
@export

strax/io.py Outdated
return np.frombuffer(data, dtype=dtype)
except ValueError as e:
raise ValueError(f"ValueError while loading data with dtype =\n\t{dtype}") from e
except Exception as e:
Copy link

Copilot AI Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first except Exception bar will catch all exceptions, so the subsequent except Exception block is never reached. Consider combining handlers or narrowing exception types to preserve the DataCorrupted path.

Suggested change
except Exception as e:
except (ValueError, KeyError) as e:

Copilot uses AI. Check for mistakes.
strax/io.py Outdated
Comment on lines 168 to 178
s3_interface = s3_client
# Copy temp file to final file
result = _save_file_to_s3(s3_interface, temp_fn, data, Bucket, compressor)
s3_interface.copy_object(
Bucket=Bucket,
Key=final_fn,
CopySource={"Bucket": Bucket, "Key": temp_fn},
)

# Delete the temporary file
s3_interface.delete_object(Bucket=Bucket, Key=temp_fn)
Copy link

Copilot AI Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Inconsistent variable naming between s3_client and s3_interface can be confusing; pick one name for clarity.

Suggested change
s3_interface = s3_client
# Copy temp file to final file
result = _save_file_to_s3(s3_interface, temp_fn, data, Bucket, compressor)
s3_interface.copy_object(
Bucket=Bucket,
Key=final_fn,
CopySource={"Bucket": Bucket, "Key": temp_fn},
)
# Delete the temporary file
s3_interface.delete_object(Bucket=Bucket, Key=temp_fn)
# Copy temp file to final file
result = _save_file_to_s3(s3_client, temp_fn, data, Bucket, compressor)
s3_client.copy_object(
Bucket=Bucket,
Key=final_fn,
CopySource={"Bucket": Bucket, "Key": temp_fn},
)
# Delete the temporary file
s3_client.delete_object(Bucket=Bucket, Key=temp_fn)

Copilot uses AI. Check for mistakes.
strax/io.py Outdated
Comment on lines 121 to 124
file_data = response["Body"].read() # Read the content of the file from S3

# Create a file-like object from the binary data
file_buffer = BytesIO(file_data)
Copy link

Copilot AI Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Reading the entire S3 object into memory with read() may not scale for large files. Consider streaming decompression or chunked reads to reduce memory usage.

Suggested change
file_data = response["Body"].read() # Read the content of the file from S3
# Create a file-like object from the binary data
file_buffer = BytesIO(file_data)
file_buffer = BytesIO() # Create a file-like object to store the data
for chunk in response["Body"].iter_chunks(chunk_size=DECOMPRESS_BUFFER_SIZE):
file_buffer.write(chunk)
file_buffer.seek(0) # Reset the buffer to the beginning

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants