Skip to content

Add reddit dump reader ingest tool with GCS support#5

Open
borisolver wants to merge 1 commit intomainfrom
codex/add-reddit-dump-ingestion-tool-in-rust
Open

Add reddit dump reader ingest tool with GCS support#5
borisolver wants to merge 1 commit intomainfrom
codex/add-reddit-dump-ingestion-tool-in-rust

Conversation

@borisolver
Copy link
Owner

Summary

  • add a new reddit_dump_reader tool for streaming Reddit dump NDJSON files or URLs into bulk_ingest
  • set the tool to Rust 2024 edition and document that screening for CleanApp-style reports occurs downstream
  • support gs:// inputs via gsutil and document preferring signed HTTPS URLs

Testing

  • cargo test --manifest-path tools/reddit_dump_reader/Cargo.toml (fails: crates.io index blocked by network restrictions in the environment)

Codex Task

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +155 to +158
&remaining,
batch_size,
&backend_url,
&fetcher_token,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Spawned tasks borrow backend_url/fetcher_token

The tokio::spawn closure passes &backend_url and &fetcher_token into process_input, but spawned futures must be 'static; these references point to stack-local Strings in main and make the future non‑'static, so the crate does not compile. Clone or move owned strings into the task before spawning.

Useful? React with 👍 / 👎.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — will clone owned copies of backend_url and fetcher_token before tokio::spawn to satisfy 'static.

Comment on lines +401 to +405
let resp = client.get(input).send().await?.error_for_status()?;
let stream = resp
.bytes_stream()
.map_err(|e| io::Error::new(io::ErrorKind::Other, e));
let reader = StreamReader::new(stream);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge HTTP reader uses map_err without StreamExt in scope

The HTTP branch builds resp.bytes_stream().map_err(...), but no StreamExt/TryStreamExt trait is imported, so .map_err is not available on the stream and the new crate fails to compile for HTTP inputs. Bring the appropriate extension trait into scope or avoid using map_err here.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant