Add Streaming Dataset Loader Support#55
Conversation
|
Great point you are raising! We just pushed a big update to the |
|
Hi @lusxvr Let me know what you think! 😊 |
|
When I try this I get a bunch of errors in the image processing: |
|
Hi @lusxvr , thanks a lot for your feedback! 🙌 Quick question, what data_cutoff_idx are you currently using in your config file? I ran a test using a small cutoff of just 1000 images, and everything trained without any issues on my side. Here’s a quick summary from my run: Would love to compare settings and see if anything on your end might be affecting the training. Let me know! |
|
I used 2000. Some of the datasets are a bit different, so it might only come with later samples. |
|
Hi @lusxvr, I’ve pulled the latest changes and I ran two tests — one with data_cutoff_idx=2000 and another with 10000. Here are the results. However, I wasn’t able to reproduce the Error processing image at index .... The only difference on my side is the batch size, which I set lower since I don’t have enough RAM to run with 256. All other configurations remain the same. data_cutoff_idx: int = 2000data_cutoff_idx: int = 10000I also tested it directly on Google Colab, and it runs fine there as well. Here’s the code I used. I haven’t added it to the Colab example in the repo, but if you think it would be helpful, I’d be happy to update it! Let me know what you think. |
|
Hey @lusxvr ! Just checking in, let me know if there's anything else you'd like me to add or adjust in this PR. I think this feature could really help folks with limited RAM setups, so I’d love to help push it forward. |
|
Taking a look right now :) |
|
I don't know if it is just because I am running it for the first time and it has to newly fill the cache due to the new download but for the moment it seems very slow to me (but I don't know why). After my comment (10min ago), I pulled the PR and started a run with 15000 samples and it is still "Resolving Data Files" |
|
Hm, sorry to report that I still have the same error. It comes from I don't know why though at the moment. |
|
Unfortunately It does not seem to work correctly for me. We have the error with processing the image files correctly, but in addition even when I relaunch a training run with the same parameters and the streaming support, it takes a substantial amount of time to load the data. It seems to me that since we are still taking a sample from every individual dataset until we reach the cutoff index, the whole streaming does not work correctly yet / does not fall back efficiently to the cache. I believe for low resource scenarios, it is viable to just select one or two small datasets individually from the whole cauldron, I tried this in colab and it works quite well (see notebook in the repo). While I still really like the idea of enabling streaming for the datasets, unfortunately I cannot reproduce it correctly on my machine at the moment so I cannot merge it just yet :( |
Congrats again, for this amazing repo! 🎉
This PR adds support for streaming datasets when using large-scale datasets with data_cutoff_idx. By enabling streaming, only the required number of samples are loaded on-the-fly, significantly reducing disk usage. This addresses #54.