Skip to content

Failed to start a parallel pool #2

@soichih

Description

@soichih

We are seeing a lot of failed jobs due to "Failed to start a parallel pool".

A couple of things we could try..

Right now, this App uses tempname() to generate the temp path for JobStorageLocation. I believe it uses /tmp as parent directory.

I wonder if we could use the current working directory instead.

Instead, I think we should create it under the current working directory.. in case use of /tmp is somehow causing the issue.

%need to use different profile directory to make sure multiple jobs won't share the same directory and crash
profile_dir='./profile';
mkdir(profile_dir);
c = parcluster();
c.JobStorageLocation = profile_dir;
pool = parpool(c, config.workers);

Right now, this App is skipping to set JobStorageLocation if mkdir(tmpdir) fails.

% check and set cachedir location
if OK
    % set local storage for parpool
    clust.JobStorageLocation = tmpdir;
end

I suggest removing this block and let the App fail if it fails to create a tmpdir (or at least add the log message inside the block to know that we are setting the JobStorageLocation

I have seen a similar parpool startup failure / random matlab crash before. I've workaround this by simply rerunning the code a few times if it starts to fail.

https://github.com/brain-life/app-dp-modelfit/blob/master/fit_model.sh#L39

It's ugly but very simple thing to try.. and for the DP App this has cured the issue of occasional hiccups.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions