Generate APEL blah/batch records using PER_JOB_HISTORY_DIR
HTCondor-CE users often have to upload accounting records to the grid that they belong to: OSG sites use Gratia to upload records to the GRACC and EGI/WLCG sites upload APEL records. These grid records are generally constructed using a combination of the CE job ad and info from the local batch system. HTCondor-CE includes the htcondor-ce-apel package which includes scripts and systemd files that are used to generate and upload the APEL records. APEL records are created once per day through https://github.com/htcondor/htcondor-ce/blob/V5-branch/contrib/apelscripts/condor_ce_apel.sh#L7-L8
The current script setup results in a 5-10% loss of jobs reported to APEL as reported in:
This is because the scripts uses condor_history with time-based constraints to construct the “blah” and “batch” records:
However, time-based history constraints are not guaranteed to pull ads for all jobs, depending on when jobs are added to the history file.
Base your work on V5_branch and submit a PR to when it’s ready for review.
Similar to the OSG’s method with Gratia, we should set PER_JOB_HISTORY_DIR = /var/lib/condor/history in the HTCondor config laid down by htcondor-ce-apel ( ) and change the condor_batch.sh and condor_blah.sh so that:
They are merged into a single script (they both use condor_history so they’re grabbing all their info from the local job ad)
Parse each file in PER_JOB_HISTORY_DIR and upon success, remove the history file. If there are errors parsing the file, move it into /var/lib/condor/history/quarantine
Update the systemd timer ( ) to run more frequently: say once an hour instead of once a day
Comments addressed in and . LGTM
I’d quarantine a history file if you fail to get any of the attributes required to create the batch/blah records
Question about “If there are errors parsing the file, move it into /var/lib/condor/history/quarantine”. How exhaustive does the error checking need to be? Right now I’m awking for a ClusterId value, and if it can’t find one I’m putting the file into quarantine. Does that seem sufficient?
Hey I believe that the apelclient expects two files so we’ll want to keep that behavior. I don’t know the details of the file names it expects – could you ask Max in the original GitHub issue?
Important question: what do we want the output from this new script to look like? Previously we generated two separate output files (once for batch records, another for BLAH records); now that everything is unified in a single script, do we only want a single output file per script run? Also, now that we’ll be gathering data on an hourly basis, I’m assuming we should include this in the filename timestamp? Output files should be named something like apel-20210311-1300-lhcb-ce?