From b3003969a9afdcbedf7f7a1cb9d5b762d18f2913 Mon Sep 17 00:00:00 2001 From: Lune Bellec Date: Sat, 8 Nov 2025 11:50:07 -0500 Subject: [PATCH 1/2] Create ssh_repo_elm.md I tried to build my first reproducible analysis with datalad repos for input / output, using data hosted on elm via ssh. And I struggled :) so I put together this tutorial with the help of chatgpt. I think I got it to work, but it's very possible I made mistakes. And I really struggled with datalad's docs. So hopefully more knowledgeable people can review and confirm this is in order, and the tutorial can save time to others (and myself) in the future. --- datalad/ssh_repo_elm.md | 98 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 98 insertions(+) create mode 100644 datalad/ssh_repo_elm.md diff --git a/datalad/ssh_repo_elm.md b/datalad/ssh_repo_elm.md new file mode 100644 index 0000000..8ce5ed8 --- /dev/null +++ b/datalad/ssh_repo_elm.md @@ -0,0 +1,98 @@ +## 🧠 Datalad for Students: Minimal Reproducible Workflow + +### 📦 1. Create a Datalad Dataset for Data (on `elm`) + +**On your local machine:** + +```bash +datalad create -c text2git image10k-zooniverse +cd image10k-zooniverse +``` + +**Annex the big files (e.g. CSVs):** + +```bash +echo "*.csv annex.largefiles=anything" >> .gitattributes +datalad save -m "Set annex rules for CSVs" +``` + +**Push data to `elm`:** + +```bash +datalad create-sibling \ + --name elm \ + --site datalad \ + --sshurl ssh://elm/data/simexp/pbellec/image10k-zooniverse \ + --shared all + +datalad push --to elm --data anything +``` + +**Push Git-only metadata to GitHub (optional):** + +```bash +datalad create-sibling-github courtois-neuromod image10k-zooniverse \ + --github-organization courtois-neuromod \ + --access-protocol ssh + +datalad push --to origin +``` + +--- + +### 👩‍💻 2. For Students: Install and Use + +**Clone the dataset from GitHub or `elm`:** + +```bash +# Option A: from GitHub (metadata only) +datalad install git@github.com:courtois-neuromod/image10k-zooniverse.git + +# Option B: from elm (knows about the data) +datalad install ssh://elm/data/simexp/pbellec/image10k-zooniverse.git +``` + +**Navigate and get data:** + +```bash +cd image10k-zooniverse +datalad get Zooniverse_Results_2022_01_28.csv +``` + +--- + +### 🖼 3. Managing Outputs (Optional) + +**Create a separate dataset for outputs:** + +```bash +datalad create image10k-zooniverse.plots +cd image10k-zooniverse.plots + +echo "*.png annex.largefiles=anything" >> .gitattributes +datalad save -m "Track plots in annex" +``` + +**Link it back into the analysis repo:** + +```bash +cd image10k-zooniverse +datalad install -d . -s ../image10k-zooniverse.plots plots +``` + +--- + +### ⚠️ Tips & Troubleshooting + +* If `datalad get` fails with `annex-ignore`, you likely cloned from GitHub only. Clone once from `elm` to propagate sibling config. +* To inspect siblings: + +```bash +datalad siblings +``` + +* To pull subdataset updates: + +```bash +datalad update --merge +``` From 0848cf559d9469a0cc1eefbd29f7a73a4da367bd Mon Sep 17 00:00:00 2001 From: Luna Bellec Date: Sun, 9 Nov 2025 16:59:35 -0500 Subject: [PATCH 2/2] Some revisions based on feedback. Still not completely fixed. --- datalad/ssh_repo_elm.md | 44 +++++++++++++++++++++++++---------------- 1 file changed, 27 insertions(+), 17 deletions(-) diff --git a/datalad/ssh_repo_elm.md b/datalad/ssh_repo_elm.md index 8ce5ed8..d4062b7 100644 --- a/datalad/ssh_repo_elm.md +++ b/datalad/ssh_repo_elm.md @@ -5,38 +5,49 @@ **On your local machine:** ```bash -datalad create -c text2git image10k-zooniverse -cd image10k-zooniverse +datalad create -c text2git REPONAME +cd REPONAME ``` **Annex the big files (e.g. CSVs):** +It is important to properly configure `.gitattributes` such that the right files get annexed. The `text2git` configuration typically configures text files to be stored in `git` instead of being annexed. More info in the [datalad documentation](https://handbook.datalad.org/en/latest/basics/101-124-procedures.html). But you may want to manually set rules to ensure the content you want annexed indeed is. For example if you plan to store you data in `csv` files: ```bash echo "*.csv annex.largefiles=anything" >> .gitattributes datalad save -m "Set annex rules for CSVs" ``` -**Push data to `elm`:** - +**Add data to the repository:** +You can just add files in the repository and save its current state with the following command: +``` +datalad save -m "Adding some data" +``` +**Create a new sibling of the repository on `elm`:** +(update the path to a location under your own USERNAME): ```bash datalad create-sibling \ --name elm \ - --site datalad \ - --sshurl ssh://elm/data/simexp/pbellec/image10k-zooniverse \ - --shared all - -datalad push --to elm --data anything + ssh://elm/data/simexp/USERNAME/REPONAME \ + --existing=skip \ +``` +**Push data to `elm`:** +You can now easily maintain a versionized backup of your data on elm. +``` +datalad push --to elm ``` -**Push Git-only metadata to GitHub (optional):** - +**Create a github record of meta-data:** +First, create a repo called REPONAME on github, under some organization ORGNAME (for example `courtois-neuromod`). Keep it blank, no README or LICENSE. Then, add this repo as sibling of the dataset: ```bash -datalad create-sibling-github courtois-neuromod image10k-zooniverse \ - --github-organization courtois-neuromod \ - --access-protocol ssh +datalad siblings add -s origin --url git@github.com:ORGNAME/REPONAME.git +``` +**Push Git-only metadata to GitHub (optional):** +It is now easy to push metadata to github: +``` datalad push --to origin ``` +Note that if you misconfigured datalad you may push sensitive data on github. First, check using `ls -alsh` that the sensitive data appears as links pointing to git-annex rather than actual files. Second, start by making the repo private until you're share no sensitive data was pushed by mistake. If you pushed sensitive data by mistake, just delete the repository and start fresh if you can. Otherwise you'll need to edit the git+git-annex history of the repository, good luck :/ --- @@ -48,15 +59,14 @@ datalad push --to origin # Option A: from GitHub (metadata only) datalad install git@github.com:courtois-neuromod/image10k-zooniverse.git -# Option B: from elm (knows about the data) +# Option B: from elm (with the actual data) datalad install ssh://elm/data/simexp/pbellec/image10k-zooniverse.git ``` **Navigate and get data:** ```bash -cd image10k-zooniverse -datalad get Zooniverse_Results_2022_01_28.csv +datalad get EXAMPLEFILE.csv ``` ---