Improve utilization of 2019 CC* GPU resources

Description

2019 CC* sites that we expect to have GPU resources available for use:

Expected

AMNH, Clarkson, Montana, Notre Dame, TCNJ, Wayne State

Reporting to Miron Dashboard

Notre Dame, Wayne State

Reporting to GRACC

Notre Dame, TCNJ, Wayne State

For the following sites, we are waiting on action from the site admin:

Site

Reason

Last Contact

AMNH

Admin needs to create separate GPU queue

August 13, 2020

Clarkson

Admin needs to add GPU nodes. No timeline.

July 31, 2020

Montana

Admin needs to set up new cluster.

June 19, 2020

We have identified 3 different problems with sites missing from the Miron dashboard:

  1. Missing GPU configuration in the CEs and Factory (AMNH, Clarkson): Operations will reach out to site admins and ask about GPU resources available to the OSG.

  2. Topology and factory configuration mismatch (TCNJ, Wayne State): We tag CC* resources at the CE level in Topology but due to the configuration mismatch, TCNJ WSU records are only associated with their site. The Miron dashboard looks for CC* hours based on the CE.

  3. Potential lack of GPU job pressure (TCNJ): We have verified that the factory and CE are configured to request GPUs and that pilots reporting to the Open Science pool are advertising their GPUs. There is a noticeable drop in job pressure within the OSG VO starting in May.

Separately, we’d like to improve the monitoring of CC* GPU resources by advertising total running, idle, and held GPU jobs to the central collector by implementing the following:

Freshdesk Tickets

None

Assignee

Brian Lin

Reporter

Brian Lin

Priority

Major

Labels

None

Components

None

Due date

2020/09/30

Epic Name

2019 CC* GPUs
Configure