Skip to content

Grain pool is exiting #2878

@stefan-it

Description

@stefan-it

Bug report

Hi,

I am getting a re-producible error when using Grain with Array Records during training:

I1223 02:12:43.059057 135058149082688 grain_pool.py:547] Shutting down multiprocessing system.                                           
I1223 02:12:44.768291 135058149082688 grain_pool.py:542] Grain pool is exiting.                                                          
I1223 02:12:44.768418 135058149082688 grain_pool.py:547] Shutting down multiprocessing system.                                           
I1223 02:12:44.768492 135058149082688 grain_pool.py:547] Shutting down multiprocessing system.                                           
Exception ignored in: <Finalize object, dead>                                                                                            Traceback (most recent call last):                                                                                                       
  File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/util.py", line 227, in __call
__                                                                                                                                           res = self._callback(*self._args, **self._kwargs)                                                                                    
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                    
  File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/synchronize.py", line 87, in _cleanup                                                                                                                                 
    sem_unlink(name)                                                                                                                     
FileNotFoundError: [Errno 2] No such file or directory                                                                                   Exception ignored in: <Finalize object, dead>                                                                                            
Traceback (most recent call last):                                                                                                       
  File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/util.py", line 227, in __call
__                                                                                                                                       
    res = self._callback(*self._args, **self._kwargs)                                                                                    
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                    
  File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/synchronize.py", line 87, in 
_cleanup
    sem_unlink(name)                                                
FileNotFoundError: [Errno 2] No such file or directory                                                                                   
Exception ignored in: <Finalize object, dead>                                                                                            
Traceback (most recent call last):                                  
  File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/util.py", line 227, in __call
__


FileNotFoundError: [Errno 2] No such file or directory                                                                                   
/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning:
 resource_tracker: There appear to be 27 leaked semaphore objects to clean up at shutdown                                                
  warnings.warn('resource_tracker: There appear to be %d '                                                                               
/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:292: UserWarning:
 resource_tracker: '/mp-juwtjh9_': [Errno 2] No such file or directory                                                                   
  warnings.warn('resource_tracker: %r: %s' % (name, e))

It looks very strange. I copied all Array Records to my local v6e-8 TPU instance and started training with:

python3 -m MaxText.train src/MaxText/configs/base.yml \
  run_name=$RUN_NAME \
  base_output_directory=$DATASET_PATH/$RUN_NAME \
  dataset_type=grain \
  grain_file_type=arrayrecord \
  grain_train_files="/home/stefan/pretraining_corpus_ablation_2_1/*/*.array_record" \
  grain_worker_count=1 \
  train_split=train \
  async_checkpointing=false \
  model_name=brotchen-lm-1b \
  learning_rate=6e-06 \
  per_device_batch_size=32 \
  gradient_accumulation_steps=4 \
  num_epoch=2 \
  steps=10500 \
  max_target_length=2048 \
  packing=false \
  checkpoint_period=250 \
  tokenizer_type=huggingface tokenizer_path=/home/stefan/brotchen-lm-ablation-2-1

the configuration file is here.

I could reproduce that error in at least two runs:

In the first run it occurs after 500 steps, in the second training after 4500 steps.

I used commit bc53aaa of MaxText.

Logs/Output

No response

Environment Information

No response

Additional Context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions