Generate pilot records for OSG pilot containers

Description

Syracuse has been making use of a docker container that they spin up on their worker nodes in lieu of accepting pilot jobs through the GlideinWMS factory -> CE -> batch system workflow. These container spin up and act like a pilot job, reporting back to the respective VO pool with one catch: we're not producing pilot records for the contributions of these containers!

Unfortunately, these containers can be stopped by the admin at any time so it's tough to capture all of the usage but we can try to capture at least some of it by having the container upload a "pilot" (BatchPilot?) record every ~4 hrs

Design doc:

Freshdesk Tickets

None

Activity

Show:
Brian Lin
April 14, 2021, 3:40 PM

Pushing this through to RFR along with the other tickets

Carl Edquist
March 11, 2021, 7:52 PM

PR merged.

Promotions
Promoted gratia-probe-1.23.1-1 to osg-3.5-el*-testing

Build

Tag

gratia-probe-1.23.1-1.osg35.el7

osg-3.5-el7-testing

gratia-probe-1.23.1-1.osg35.el8

osg-3.5-el8-testing

Carl Edquist
November 20, 2020, 7:11 PM

For now i did make a small fix to report SiteName properly (previously it had been broken and commented out since it was trying to send "Site" instead of "SiteName"). I see at least distinct SiteNames are starting to show up now (eg "ISI" instead of the generic "OSG Pilot Container Probe").

https://gracc.opensciencegrid.org/kibana/goto/86d60cb2ddcdbb01af5d605c62187fa4

It's not clear to me what we want for the fqdn part of the ProbeName, or if we even care.

Carl Edquist
November 20, 2020, 6:28 PM

I've worked through a number of issues finally getting this going in the tiger osg dev instance; records have begun showing up in gracc

https://gracc.opensciencegrid.org/kibana/goto/5d0d7a2a64451846520b96c5029fb33f

Notably, the SiteName is generic (OSG Pilot Container Probe) and the fqdn par of the ProbeName is a fixed random string (the randomly generated hostname for the container).

would like to see a multi-site variant, with the Probe/SiteName set separately for each site. He pointed to an example in the condor probe, where the probe re-initialized in a forked process for each alternate site:

https://github.com/opensciencegrid/gratia-probe/blob/05252b57a958a4677601448b08a22241c228b03b/condor/condor_meter#L883-L946

Carl Edquist
October 22, 2020, 2:26 PM
Fixed

Assignee

Tim Theisen

Reporter

Brian Lin

Priority

Critical

Fix versions

Components