Schedd restart reconnect counts can be off
When the schedd restarts, it keeps a summary of its attempts to reconnect to running jobs. This ends up in a restart report and stats in the schedd's ad. In the summary, the count of reconnects still being attempted can be off. We've seen cases in CHTC where the schedd thinks one reconnect is still being attempted after a week.
Currently, the count of reconnects currently being attempted is incremented when the job queue is initialized and decremented each time a shadow indicates a reconnect succeeded or failed. If we miss a case where a decrement should happen, the count will never reach zero. Ideally, we’d recompute the count based on active records. But that gets a little complicated, as reconnects progress through several queues (jobsToReconnect, RunnableJobQueue, aboutToSpawnJobHandler worker thread, shadowsByProcID). The dedicated scheduler has its own separate path to traverse. So for now, we will just shore up the code paths where the increments and decrements can get off.
I added a comment in the code. I don’t think we need a dprintf().
Everything – after explanation – looks good to me except in jobExitCode()'s handling of JOB_RECONNECT_FAILED, where maybe we should tell someone if we think the shadow is buggy (because the inner conditional failed); but this is not a blocker.