Skip to content

Robustness against network errors #2861

@stefan-it

Description

@stefan-it

Feature or Model Request

Hi,

I'm currently training some ablation models on a v6e8 TPU and I sometimes got network errors.

The TPU is in europe-west4-a and the GCP is in europe-west4. I am using it with Cloud Storage FUSE and mounting is done with:

export TIMESTAMP=$(date +%Y%m%d-%H%M)
export DATASET_GCS_BUCKET=german-maxtext
export MOUNT_PATH=/tmp/gcsfuse
mkdir -p /tmp/gcsfuse
gcsfuse -o ro --implicit-dirs  --log-file=$HOME/gcsfuse_$TIMESTAMP.json "$DATASET_GCS_BUCKET" "$MOUNT_PATH"

During my trainings I got two crashed, with:

Training stopped: `load_next_batch()` failed with <class 'RuntimeError'> exception: (Grain worker 0 failed with the following error:

Traceback (most recent call last):                                                                                                                                                     
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/grain_pool.py", line 262, in _worker_loop                                                   
    next_element = next(element_producer)                                                                                                                                              
                   ^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                              
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/prefetch.py", line 582, in __call__                                 
    for element in it:                                                                                                                                                                 
                   ^^                                                                                                                                                                  
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper                                                     
    result = next_fn(iterator)                                                                                                                                                         
             ^^^^^^^^^^^^^^^^^                                                                                                                                                         
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/map.py", line 263, in __next__                                      
    element = next(self._parent)                                                                                                                                                       
              ^^^^^^^^^^^^^^^^^^                                                                                                                                                       
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper                                                     
    result = next_fn(iterator)                                                                                                                                                         
             ^^^^^^^^^^^^^^^^^                                                                                                                                                         
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/batch.py", line 230, in __next__                                    
    values.append(next(self._parent))                                                                                                                                                  
                  ^^^^^^^^^^^^^^^^^^                                                                                                                                                   
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper                                                     
    result = next_fn(iterator)                                                                                                                                                         
             ^^^^^^^^^^^^^^^^^                                                                                                                                                         
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/map.py", line 263, in __next__                                      
    element = next(self._parent)                                                                                                                                                       
              ^^^^^^^^^^^^^^^^^^                                                                                                                                                       
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper                                                     
    result = next_fn(iterator)                                                                                                                                                         
             ^^^^^^^^^^^^^^^^^                                                                                                                                                         
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/map.py", line 263, in __next__                                      
    element = next(self._parent)                                                                                                                                                       
              ^^^^^^^^^^^^^^^^^^                                                                                                                                                       
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper                                                     
    result = next_fn(iterator)                                                                                                                                                         
             ^^^^^^^^^^^^^^^^^                                                                                                                                                         
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/map.py", line 263, in __next__
    element = next(self._parent)
              ^^^^^^^^^^^^^^^^^^
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper
    result = next_fn(iterator)
             ^^^^^^^^^^^^^^^^^
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/map.py", line 263, in __next__
    element = next(self._parent)
              ^^^^^^^^^^^^^^^^^^
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper
    result = next_fn(iterator)
             ^^^^^^^^^^^^^^^^^
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/map.py", line 263, in __next__
    element = next(self._parent)
              ^^^^^^^^^^^^^^^^^^
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper
    result = next_fn(iterator)
             ^^^^^^^^^^^^^^^^^
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/prefetch.py", line 211, in __next__
    element = element.result()
              ^^^^^^^^^^^^^^^^
  File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 401, in wrapped_get_item
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/prefetch.py", line 87, in _getitem
    return stats.record_bytes_consumed(parent[index])
                                       ~~~~~~^^^^^^^
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/slice.py", line 51, in __getitem__
    return self._parent[parent_index]
           ~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/repeat.py", line 76, in __getitem__
    return self._stats.record_output_spec(self._parent[index])
                                          ~~~~~~~~~~~~^^^^^^^
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/shuffle.py", line 78, in __getitem__
    return self._parent[shuffled_index]
           ~~~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 374, in wrapped
    result = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/source.py", line 86, in __getitem__
    result = self._stats.record_output_spec(self._source[index % len(self)])
                                            ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 374, in wrapped
    result = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/data_sources.py", line 121, in __getitem__
    data = super().__getitem__(record_key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: pread() failed: Network is unreachable; reading /tmp/gcsfuse/brotchen-lm/pretraining_corpus_ablation_2_1/llammlein_german/train_000006.array_record; at byte 1431733143; 
Could not read from the underlying reader; at byte 0

I am using Array Records and the pretraining configuration is:

  dataset_type=grain \
  grain_file_type=arrayrecord \
  grain_train_files="/tmp/gcsfuse/brotchen-lm/pretraining_corpus_ablation_2_1/*/*.array_record" \
  grain_worker_count=1 \

My feature request now would be to make the pretraining more robust against these network errors. Maybe it can be configured on the Cloud Storage FUSE mount side, but a kind of re-try mechanism would be very helpful in MaxText.

Many thanks!

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions