-
Notifications
You must be signed in to change notification settings - Fork 443
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Bug report
Hi,
I am getting a re-producible error when using Grain with Array Records during training:
I1223 02:12:43.059057 135058149082688 grain_pool.py:547] Shutting down multiprocessing system.
I1223 02:12:44.768291 135058149082688 grain_pool.py:542] Grain pool is exiting.
I1223 02:12:44.768418 135058149082688 grain_pool.py:547] Shutting down multiprocessing system.
I1223 02:12:44.768492 135058149082688 grain_pool.py:547] Shutting down multiprocessing system.
Exception ignored in: <Finalize object, dead> Traceback (most recent call last):
File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/util.py", line 227, in __call
__ res = self._callback(*self._args, **self._kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/util.py", line 227, in __call
__
res = self._callback(*self._args, **self._kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/synchronize.py", line 87, in
_cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/util.py", line 227, in __call
__
FileNotFoundError: [Errno 2] No such file or directory
/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning:
resource_tracker: There appear to be 27 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:292: UserWarning:
resource_tracker: '/mp-juwtjh9_': [Errno 2] No such file or directory
warnings.warn('resource_tracker: %r: %s' % (name, e))It looks very strange. I copied all Array Records to my local v6e-8 TPU instance and started training with:
python3 -m MaxText.train src/MaxText/configs/base.yml \
run_name=$RUN_NAME \
base_output_directory=$DATASET_PATH/$RUN_NAME \
dataset_type=grain \
grain_file_type=arrayrecord \
grain_train_files="/home/stefan/pretraining_corpus_ablation_2_1/*/*.array_record" \
grain_worker_count=1 \
train_split=train \
async_checkpointing=false \
model_name=brotchen-lm-1b \
learning_rate=6e-06 \
per_device_batch_size=32 \
gradient_accumulation_steps=4 \
num_epoch=2 \
steps=10500 \
max_target_length=2048 \
packing=false \
checkpoint_period=250 \
tokenizer_type=huggingface tokenizer_path=/home/stefan/brotchen-lm-ablation-2-1the configuration file is here.
I could reproduce that error in at least two runs:
In the first run it occurs after 500 steps, in the second training after 4500 steps.
I used commit bc53aaa of MaxText.
Logs/Output
No response
Environment Information
No response
Additional Context
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working