Schedd restart reconnect counts can be off

Description

When the schedd restarts, it keeps a summary of its attempts to reconnect to running jobs. This ends up in a restart report and stats in the schedd's ad. In the summary, the count of reconnects still being attempted can be off. We've seen cases in CHTC where the schedd thinks one reconnect is still being attempted after a week.

Currently, the count of reconnects currently being attempted is incremented when the job queue is initialized and decremented each time a shadow indicates a reconnect succeeded or failed. If we miss a case where a decrement should happen, the count will never reach zero. Ideally, we’d recompute the count based on active records. But that gets a little complicated, as reconnects progress through several queues (jobsToReconnect, RunnableJobQueue, aboutToSpawnJobHandler worker thread, shadowsByProcID). The dedicated scheduler has its own separate path to traverse. So for now, we will just shore up the code paths where the increments and decrements can get off.

Activity

Show:
Jaime Frey
March 4, 2021, 1:38 AM

I added a comment in the code. I don’t think we need a dprintf().

Todd L Miller
March 3, 2021, 11:27 PM
Edited

Code Review

Everything – after explanation – looks good to me except in jobExitCode()'s handling of JOB_RECONNECT_FAILED, where maybe we should tell someone if we think the shadow is buggy (because the inner conditional failed); but this is not a blocker.

Due date

None

Time remaining

0m

Assignee

Jaime Frey

Is PATh development

None

Fix versions

Priority

Minor

HTCondorCustomerGroup

CHTC

Components

Reporter

Jaime Frey