Skip to content

Conversation

@Raja-89
Copy link

@Raja-89 Raja-89 commented Dec 17, 2025

Docs: Clarify usage instructions for English pipeline

Description

This PR updates the README to clarify execution details for the English YouTube
caption processing pipeline. It documents several assumptions that are currently
implicit in the workflow, including script execution location, output handling,
and expected directory structure.

Motivation and Context

While testing the English pipeline as a new contributor, I noticed a few execution
details that are not currently explicit in the documentation. In particular:

  • Scripts in the english/ directory are expected to be run from that directory,
    rather than from the repository root.
  • Some Python scripts write output to STDOUT by default and rely on shell
    redirection to create output files.
  • The pipeline assumes the presence of a specific sibling directory structure
    (e.g. test_corpus/vtt, test_corpus/json, test_corpus/conll_input) that must
    exist prior to running the scripts.

Documenting these assumptions should make initial setup and onboarding clearer
for new contributors, including those working on Windows who may need to create
directories manually.

🛠️ Changes

  • Added a Contributor Guidance section to the README.
  • Clarified where English pipeline scripts are expected to be executed from.
  • Added an example showing STDOUT redirection for direct script usage.
  • Documented the required directory hierarchy used by the pipeline.
  • Added brief platform notes regarding Windows and Unix-like environments.

📂 Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update

✅ Checklist

  • I have read the CONTRIBUTING guidelines.
  • The changes introduce no code modifications.
  • The instructions were verified by running the commands locally.

@Raja-89
Copy link
Author

Raja-89 commented Dec 17, 2025

PR opened - awaiting review.

@Raja-89
Copy link
Author

Raja-89 commented Dec 21, 2025

Hi! Thanks for maintaining this repository.

While testing the English YouTube pipeline end-to-end as a new contributor, I ran into a few undocumented execution assumptions (working directory, required output folders, STDOUT behavior, platform differences). I’ve documented these here to make onboarding smoother.

I also drafted a minimal reproducible run and a short list of common failure modes locally while testing the pipeline further. Happy to share or extend this if it would be useful for contributor onboarding.

Thanks again for your time and feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant