-
Notifications
You must be signed in to change notification settings - Fork 5
Closed
Description
bart-logger uses job_id as part of the file name for the record files, and the slurm plugin uses the sacct JobID field as job_id. If a job array with a lot of array tasks still pending is cancelled, then sacct will report a single record for these pending tasks, with JobID the full list:
sacct -j 3460472 -o start,state,end,jobid%400
Start State End JobID
------------------- ---------- ------------------- -----------------------------------------------------
2021-09-09T12:06:50 CANCELLED+ 2021-09-09T12:06:50 3460472_[1,11,21,31,41,51,61,71,81,91,101,111,121,131,141,151,161,171,181,191,201,211,221,231,241,251,261,271,281,291,301,311,321,331,341,351,361,371,381,391,401,411,421,431,441,451,461,471,481,491,501,511,521,531,541,551,561,571,581,591,601,611,621,631,641,651,661,671,681,691,701,711,721,731,741,751,761,771,781,791,801,811,821,831,841,851,861,871,881,891,901,911,921,931,941,951,961,971,981,991]
The problem would have been avoided with #26, since start time and end time is set to the cancellation time when pending jobs are cancelled, leading to zero walltime.
But perhaps a more cleaner fix is to use sacct -o JobIDRaw,... instead of sacct -o JobID,.... We have tested this on one of our clusters, and I'll create a pull request for it:
sacct -j 3460472 -o start,state,end,jobidraw
Start State End JobIDRaw
------------------- ---------- ------------------- ------------
2021-09-09T12:06:50 CANCELLED+ 2021-09-09T12:06:50 3460472
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels