Skip to content

Conversation

@tat-ohmura
Copy link
Contributor

This pull request adds an executor that supports NQSV (https://www.nec.com/en/global/solutions/hpc/articles/tech08.html). This is a proprietary job scheduler developed by NEC and is used in multiple large-scale HPC systems in Japan.
To wait for the completion of a job, PSI/J currently polls the status of the job. Since NQSV offers a dedicated command (qwait ) to wait for a job, we have implemented job completion detection using qwait instead of polling.
We hope to discuss and refine the implementation based on your feedback. Once the PR is reviewed and approved, we will update the documentation accordingly.

@hategan hategan requested review from andre-merzky and hategan May 29, 2025 16:50
@hategan
Copy link
Collaborator

hategan commented May 29, 2025

Thank you again @tat-ohmura.

This seems like a pretty straightforward addition. We will review the code in the following days. I have a couple of general comments:

  • We use polling because it limits the interaction with the scheduler independently of the number of active jobs being managed. This prevents PSI/J from overwhelming schedulers when many jobs are being submitted and managed. If NQSV's qwait is an alternative that you believe scales well, it is perfectly reasonable to use that.
  • It would be nice if you could set-up the daily PSI/J tests on one or more NQSV machines so that we can keep track of the executor on https://testing.psij.io. Please let us know if you have any questions about that or need help doing so.

@tat-ohmura
Copy link
Contributor Author

Thank you for your prompt reply.

We are considering using PSI/J as a workflow framework for Urgent Computing, where time constraints are particularly strict. Since we want to detect job completion as quickly as possible, but shortening the polling interval would put a heavy load on the job scheduler, we have decided to use the qwait command.
The qwait command detects job completion in an event-driven manner. Therefore, the load is not excessively high, and I believe it has good scalability.

We would like to explore the possibility of preparing an NQSV machine. If there are any documents outlining the steps for setting up a machine in the https://testing.psij.io/ environment, could you please share them with us?

@hategan
Copy link
Collaborator

hategan commented May 30, 2025

[...]

We are considering using PSI/J as a workflow framework for Urgent Computing, where time constraints are particularly strict. Since we want to detect job completion as quickly as possible, but shortening the polling interval would put a heavy load on the job scheduler, we have decided to use the qwait command. The qwait command detects job completion in an event-driven manner. Therefore, the load is not excessively high, and I believe it has good scalability.

That is fine. I am glad that a good solution exists for NQSV.

We would like to explore the possibility of preparing an NQSV machine. If there are any documents outlining the steps for setting up a machine in the https://testing.psij.io/ environment, could you please share them with us?

This should get you started: https://github.com/ExaWorks/psij-python/blob/main/README-testing.md

Since this is a new scheduler, we may want to add some code around

env['has_slurm'] = shutil.which('sbatch') is not None
to detect NQSV. I can do that if you know of a simple command (e.g., qsub -v | grep xyz) that can be used to tell if NQSV is installed.

If you run into difficulties, it may help to chat on Slack. You can use the link at the top-right on https://psij.io, or I can send you an invite if that doesn't work well.

Copy link
Collaborator

@hategan hategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

Codespell flagged a misspelled word in a comment, so I won't merge now if you want to fix that, but it's not important.

@hategan hategan merged commit dc1b82e into ExaWorks:main Jun 5, 2025
11 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants