Generate APEL blah/batch records using PER_JOB_HISTORY_DIR

Description

Background

HTCondor-CE users often have to upload accounting records to the grid that they belong to: OSG sites use Gratia to upload records to the GRACC and EGI/WLCG sites upload APEL records. These grid records are generally constructed using a combination of the CE job ad and info from the local batch system. HTCondor-CE includes the htcondor-ce-apel package which includes scripts and systemd files that are used to generate and upload the APEL records. APEL records are created once per day through https://github.com/htcondor/htcondor-ce/blob/V5-branch/contrib/apelscripts/condor_ce_apel.sh#L7-L8

Problem

The current script setup results in a 5-10% loss of jobs reported to APEL as reported in:

This is because the scripts uses condor_history with time-based constraints to construct the “blah” and “batch” records:

However, time-based history constraints are not guaranteed to pull ads for all jobs, depending on when jobs are added to the history file.

Proposed Fix

Base your work on V5_branch and submit a PR to when it’s ready for review.

Similar to the OSG’s method with Gratia, we should set PER_JOB_HISTORY_DIR = /var/lib/condor/history in the HTCondor config laid down by htcondor-ce-apel ( ) and change the condor_batch.sh and condor_blah.sh so that:

  1. They are merged into a single script (they both use condor_history so they’re grabbing all their info from the local job ad)

  2. Parse each file in PER_JOB_HISTORY_DIR and upon success, remove the history file. If there are errors parsing the file, move it into /var/lib/condor/history/quarantine

  3. Update the systemd timer ( ) to run more frequently: say once an hour instead of once a day

Activity

Show:
Brian Lin
March 29, 2021, 8:46 PM

Code Review

Comments addressed in and . LGTM

Brian Lin
March 18, 2021, 8:25 PM

I’d quarantine a history file if you fail to get any of the attributes required to create the batch/blah records

Mark Coatsworth
March 18, 2021, 8:15 PM

Question about “If there are errors parsing the file, move it into /var/lib/condor/history/quarantine”. How exhaustive does the error checking need to be? Right now I’m awking for a ClusterId value, and if it can’t find one I’m putting the file into quarantine. Does that seem sufficient?

Brian Lin
March 11, 2021, 7:36 PM

Hey I believe that the apelclient expects two files so we’ll want to keep that behavior. I don’t know the details of the file names it expects – could you ask Max in the original GitHub issue?

Mark Coatsworth
March 11, 2021, 7:34 PM

Important question: what do we want the output from this new script to look like? Previously we generated two separate output files (once for batch records, another for BLAH records); now that everything is unified in a single script, do we only want a single output file per script run? Also, now that we’ll be gathering data on an hourly basis, I’m assuming we should include this in the filename timestamp? Output files should be named something like apel-20210311-1300-lhcb-ce?

Time remaining

0m

Assignee

Brian Lin

Is PATh development

Yes