Convert the ReadingBank dataset (from LayoutReader: Pre-training of Text and Layout for Reading Order Detection) into a Hugging Face Dataset with train/dev/test splits.
-
Original Dataset: https://mail2sysueducn-my.sharepoint.com/:u:/g/personal/huangyp28_mail2_sysu_edu_cn/Efh3ZWjsA-xFrH2FSjyhSVoBMak6ypmbABWmJEmPwtKhhw?e=tbthMD
-
Processed Dataset: https://huggingface.co/datasets/albertklorer/readingbank
-
macOS/Linux:
python -m venv .venv source .venv/bin/activate -
Windows (PowerShell):
python -m venv .venv .venv\Scripts\activate
pip install -r requirements.txt
python main.py --dataset_directory /path/to/ReadingBank --hub_repo your-username/readingbank --hf_token hf_your_token_here