Skip to content

Slurm JobID cannot always be used as file name #42

@bhmevik

Description

@bhmevik

bart-logger uses job_id as part of the file name for the record files, and the slurm plugin uses the sacct JobID field as job_id. If a job array with a lot of array tasks still pending is cancelled, then sacct will report a single record for these pending tasks, with JobID the full list:

sacct -j 3460472 -o start,state,end,jobid%400
              Start      State                 End                                                                                                                                                                                                                                                                                                                                                                                                            JobID 
------------------- ---------- -------------------                                                                                                                                                                                                                                                                                                                                                            ----------------------------------------------------- 
2021-09-09T12:06:50 CANCELLED+ 2021-09-09T12:06:50   3460472_[1,11,21,31,41,51,61,71,81,91,101,111,121,131,141,151,161,171,181,191,201,211,221,231,241,251,261,271,281,291,301,311,321,331,341,351,361,371,381,391,401,411,421,431,441,451,461,471,481,491,501,511,521,531,541,551,561,571,581,591,601,611,621,631,641,651,661,671,681,691,701,711,721,731,741,751,761,771,781,791,801,811,821,831,841,851,861,871,881,891,901,911,921,931,941,951,961,971,981,991] 

The problem would have been avoided with #26, since start time and end time is set to the cancellation time when pending jobs are cancelled, leading to zero walltime.
But perhaps a more cleaner fix is to use sacct -o JobIDRaw,... instead of sacct -o JobID,.... We have tested this on one of our clusters, and I'll create a pull request for it:

sacct -j 3460472 -o start,state,end,jobidraw
              Start      State                 End     JobIDRaw 
------------------- ---------- ------------------- ------------ 
2021-09-09T12:06:50 CANCELLED+ 2021-09-09T12:06:50 3460472 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions