condor_gpu_discovery should discover multi-instance GPUs (MIGs)

Description

NVidia's new A100 is a multi-instance GPU (MIG). HTCondor should Just Work(TM) with an A100 in one of its homogeneous partitions modes, but condor_gpu_discovery can't currently detect MIGs when operating in a partitioned mode. The fix is relatively straightforward – use the new NVML-based API for MIG device detection and use NVML APIs to get the information we currently get from CUDA APIs – but is complicated by how we match CUDA devices to the corresponding NVML devices when the `-dynamic` flag is passed, which is by PCI bus ID. Presumably, different instances of the same GPU share a PCI bus ID, so we'll have to switch over to the GUIDs.

Activity

Show:
John (TJ) Knoeller
January 14, 2021, 7:59 PM

Code review - looks good. ship it.

Todd L Miller
January 14, 2021, 2:25 PM

OK, unknown error codes now printed. Resolving.

Todd L Miller
January 13, 2021, 9:02 PM

… it’s not called in as many places as I was expecting, but you’re right; it looks like it’s always called with an explicit error argument, and in a formatted string with that error argument preceding it.

John (TJ) Knoeller
January 13, 2021, 8:59 PM

I think that one is usually used in conjunction with printing the error code separately somewhere else. it’s a strerror() replacement.

Todd L Miller
January 13, 2021, 8:51 PM

OK. Not printing the error code is from the original source; should that be changed, as well?

Due date

2021/01/15

Time remaining

0m

Assignee

Todd L Miller