-
Notifications
You must be signed in to change notification settings - Fork 443
Open
Labels
Description
Feature or Model Request
Hi,
I'm currently training some ablation models on a v6e8 TPU and I sometimes got network errors.
The TPU is in europe-west4-a and the GCP is in europe-west4. I am using it with Cloud Storage FUSE and mounting is done with:
export TIMESTAMP=$(date +%Y%m%d-%H%M)
export DATASET_GCS_BUCKET=german-maxtext
export MOUNT_PATH=/tmp/gcsfuse
mkdir -p /tmp/gcsfuse
gcsfuse -o ro --implicit-dirs --log-file=$HOME/gcsfuse_$TIMESTAMP.json "$DATASET_GCS_BUCKET" "$MOUNT_PATH"During my trainings I got two crashed, with:
Training stopped: `load_next_batch()` failed with <class 'RuntimeError'> exception: (Grain worker 0 failed with the following error:
Traceback (most recent call last):
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/grain_pool.py", line 262, in _worker_loop
next_element = next(element_producer)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/prefetch.py", line 582, in __call__
for element in it:
^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper
result = next_fn(iterator)
^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/map.py", line 263, in __next__
element = next(self._parent)
^^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper
result = next_fn(iterator)
^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/batch.py", line 230, in __next__
values.append(next(self._parent))
^^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper
result = next_fn(iterator)
^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/map.py", line 263, in __next__
element = next(self._parent)
^^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper
result = next_fn(iterator)
^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/map.py", line 263, in __next__
element = next(self._parent)
^^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper
result = next_fn(iterator)
^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/map.py", line 263, in __next__
element = next(self._parent)
^^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper
result = next_fn(iterator)
^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/map.py", line 263, in __next__
element = next(self._parent)
^^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper
result = next_fn(iterator)
^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/map.py", line 263, in __next__
element = next(self._parent)
^^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 325, in wrapper
result = next_fn(iterator)
^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/prefetch.py", line 211, in __next__
element = element.result()
^^^^^^^^^^^^^^^^
File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/stefan/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 401, in wrapped_get_item
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/prefetch.py", line 87, in _getitem
return stats.record_bytes_consumed(parent[index])
~~~~~~^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/slice.py", line 51, in __getitem__
return self._parent[parent_index]
~~~~~~~~~~~~^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/repeat.py", line 76, in __getitem__
return self._stats.record_output_spec(self._parent[index])
~~~~~~~~~~~~^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/shuffle.py", line 78, in __getitem__
return self._parent[shuffled_index]
~~~~~~~~~~~~^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 374, in wrapped
result = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/transformations/source.py", line 86, in __getitem__
result = self._stats.record_output_spec(self._source[index % len(self)])
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/dataset/stats.py", line 374, in wrapped
result = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/stefan/maxtext/maxtext_venv_2/lib/python3.12/site-packages/grain/_src/python/data_sources.py", line 121, in __getitem__
data = super().__getitem__(record_key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: pread() failed: Network is unreachable; reading /tmp/gcsfuse/brotchen-lm/pretraining_corpus_ablation_2_1/llammlein_german/train_000006.array_record; at byte 1431733143;
Could not read from the underlying reader; at byte 0I am using Array Records and the pretraining configuration is:
dataset_type=grain \
grain_file_type=arrayrecord \
grain_train_files="/tmp/gcsfuse/brotchen-lm/pretraining_corpus_ablation_2_1/*/*.array_record" \
grain_worker_count=1 \
My feature request now would be to make the pretraining more robust against these network errors. Maybe it can be configured on the Cloud Storage FUSE mount side, but a kind of re-try mechanism would be very helpful in MaxText.
Many thanks!
Additional Context
No response