-
Notifications
You must be signed in to change notification settings - Fork 33
the status of worker-0 is error, but the status of mxjob is Succeeded #38
Description
kubeflow version: 0.5.0
mxnet-operator version: v1beta1
worker-0 log:
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, data_dir='/admin/public/model/mxnet_distributed/data', disp_batches=10, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='1, 28, 28', initializer='default', kv_store='dist_device_sync', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=2, num_examples=6000, num_layers=2, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
Traceback (most recent call last):
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 99, in
fit.fit(args, sym, get_mnist_iter)
File "/admin/public/model/mxnet_model/mxnet_distributed/common/fit.py", line 180, in fit
(train, val) = data_loader(args, kv)
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 57, in get_mnist_iter
'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz')
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 37, in read_data
with gzip.open(os.path.join(args.data_dir,label)) as flbl:
File "/opt/conda/lib/python3.6/gzip.py", line 53, in open
binary_file = GzipFile(filename, gz_mode, compresslevel)
File "/opt/conda/lib/python3.6/gzip.py", line 163, in init
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/admin/public/model/mxnet_distributed/data/train-labels-idx1-ubyte.gz'
mxjob status:
{
"status": {
"completionTime": "2019-05-21T08:37:24Z",
"conditions": [
{
"lastTransitionTime": "2019-05-21T08:36:41Z",
"lastUpdateTime": "2019-05-21T08:36:41Z",
"message": "MXJob mxnet-8d1f211e is created.",
"reason": "MXJobCreated",
"status": "True",
"type": "Created"
},
{
"lastTransitionTime": "2019-05-21T08:36:41Z",
"lastUpdateTime": "2019-05-21T08:36:46Z",
"message": "MXJob mxnet-8d1f211e is running.",
"reason": "MXJobRunning",
"status": "False",
"type": "Running"
},
{
"lastTransitionTime": "2019-05-21T08:36:41Z",
"lastUpdateTime": "2019-05-21T08:37:24Z",
"message": "MXJob mxnet-8d1f211e is successfully completed.",
"reason": "MXJobSucceeded",
"status": "True",
"type": "Succeeded"
}
],
"mxReplicaStatuses": {
"Scheduler": {},
"Server": {},
"Worker": {}
},
"startTime": "2019-05-21T08:36:44Z"
}
}