Schedd crash on delayed logging of release or force-remove

Description

If a job leaves the job queue while a release or force-remove action is waiting in act_on_job_myself_queue, the schedd will crash once the job is pulled from act_on_job_myself_queue for processing. Scheduler::actOnJobMyselfHandler() calls functions that assume the job is still in the queue. We need to add a check to return early if the job is gone.

We saw this happen on a CHTC submit machine on Sunday, Feb 28. A user rapidly held, released, and removed a DAG with idle node jobs. The release events for all of the jobs ended up in the long act_on_job_myself_queue (10k entries). The removal event for the DAG job ended up in the short stop_job_queue. When the DAG job’s removal was processed, the idle node jobs were taken out of the queue immediately (as dependents of the DAG job). Later, the release events were processed (after the jobs were gone).

Activity

Show:
Jaime Frey
March 2, 2021, 7:35 PM

I waffled on the debug level. I’ll change it to D_ALWAYS. Also, I’ll tweak the version history to say this should be rare.

Greg Thain
March 2, 2021, 7:25 PM

CODE REVIEW Looks good. Question – given that we want a better fix in the fullness of time, should the dprintf be D_ALWAYS? Also, should the version history stress this is a longstanding, and we hope rare bug?

Jaime Frey
March 2, 2021, 5:01 PM

For now, add an early-out in Scheduler::actOnJobMyselfHandler() if the job is not in the queue.

Time remaining

0m

Assignee

Jaime Frey