-
Notifications
You must be signed in to change notification settings - Fork 4
Description
We are seeing a lot of failed jobs due to "Failed to start a parallel pool".
A couple of things we could try..
Right now, this App uses tempname() to generate the temp path for JobStorageLocation. I believe it uses /tmp as parent directory.
I wonder if we could use the current working directory instead.
Instead, I think we should create it under the current working directory.. in case use of /tmp is somehow causing the issue.
%need to use different profile directory to make sure multiple jobs won't share the same directory and crash
profile_dir='./profile';
mkdir(profile_dir);
c = parcluster();
c.JobStorageLocation = profile_dir;
pool = parpool(c, config.workers);
Right now, this App is skipping to set JobStorageLocation if mkdir(tmpdir) fails.
% check and set cachedir location
if OK
% set local storage for parpool
clust.JobStorageLocation = tmpdir;
end
I suggest removing this block and let the App fail if it fails to create a tmpdir (or at least add the log message inside the block to know that we are setting the JobStorageLocation
I have seen a similar parpool startup failure / random matlab crash before. I've workaround this by simply rerunning the code a few times if it starts to fail.
https://github.com/brain-life/app-dp-modelfit/blob/master/fit_model.sh#L39
It's ugly but very simple thing to try.. and for the DP App this has cured the issue of occasional hiccups.