Skip to content
This repository was archived by the owner on Feb 1, 2022. It is now read-only.
This repository was archived by the owner on Feb 1, 2022. It is now read-only.

the status of worker-0 is error, but the status of mxjob is Succeeded #38

@magicmopper

Description

@magicmopper

kubeflow version: 0.5.0
mxnet-operator version: v1beta1

kubernetes dashboard display
image

worker-0 log:
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, data_dir='/admin/public/model/mxnet_distributed/data', disp_batches=10, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='1, 28, 28', initializer='default', kv_store='dist_device_sync', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=2, num_examples=6000, num_layers=2, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
Traceback (most recent call last):
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 99, in
fit.fit(args, sym, get_mnist_iter)
File "/admin/public/model/mxnet_model/mxnet_distributed/common/fit.py", line 180, in fit
(train, val) = data_loader(args, kv)
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 57, in get_mnist_iter
'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz')
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 37, in read_data
with gzip.open(os.path.join(args.data_dir,label)) as flbl:
File "/opt/conda/lib/python3.6/gzip.py", line 53, in open
binary_file = GzipFile(filename, gz_mode, compresslevel)
File "/opt/conda/lib/python3.6/gzip.py", line 163, in init
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/admin/public/model/mxnet_distributed/data/train-labels-idx1-ubyte.gz'

mxjob status:

{
  "status": {
        "completionTime": "2019-05-21T08:37:24Z",
        "conditions": [
            {
                "lastTransitionTime": "2019-05-21T08:36:41Z",
                "lastUpdateTime": "2019-05-21T08:36:41Z",
                "message": "MXJob mxnet-8d1f211e is created.",
                "reason": "MXJobCreated",
                "status": "True",
                "type": "Created"
            },
            {
                "lastTransitionTime": "2019-05-21T08:36:41Z",
                "lastUpdateTime": "2019-05-21T08:36:46Z",
                "message": "MXJob mxnet-8d1f211e is running.",
                "reason": "MXJobRunning",
                "status": "False",
                "type": "Running"
            },
            {
                "lastTransitionTime": "2019-05-21T08:36:41Z",
                "lastUpdateTime": "2019-05-21T08:37:24Z",
                "message": "MXJob mxnet-8d1f211e is successfully completed.",
                "reason": "MXJobSucceeded",
                "status": "True",
                "type": "Succeeded"
            }
        ],
        "mxReplicaStatuses": {
            "Scheduler": {},
            "Server": {},
            "Worker": {}
        },
        "startTime": "2019-05-21T08:36:44Z"
	}
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions