slurm_status.py calls to scontrol time out for busy Slurm queues

Description

JLab is having issues with their new CE because slurm_status.py times out with calls to scontrol [1] used to fill the BLAHP's Slurm job cache:

This is because scontrol doesn't (and can't) limit its query on a particular user so it grabs all jobs, which in JLab's case is 20-30k at any given time, taking upwards of 4 minutes. We should consider replacing calls to scontrol throughout with calls to sacct or squeue.

[1] https://support.opensciencegrid.org/public/tickets/272d294ac8221ae93b9c11101e397b9b31765b2dc8a7fde0a78a634f34e7ea4e

Activity

Show:
Mat Selmeci
March 30, 2021, 9:33 PM

Simple fix. Review passed

Brian Lin
March 30, 2021, 9:22 PM

I caught an issue while reviewing HTCONDOR-333. Adding Mat as a reviewer ( )

Jaime Frey
February 26, 2021, 6:59 PM

The only changelog I see is for the debian packaging, which has a single “Initial release” item.

Tim Theisen
January 7, 2021, 4:18 PM

You could add a one line entry in the changelog(s).

Jaime Frey
January 6, 2021, 8:27 PM

That is a good question.

Time remaining

0m

Assignee

Brian Lin