condor_gpu_discovery should discover multi-instance GPUs (MIGs)
NVidia's new A100 is a multi-instance GPU (MIG). HTCondor should Just Work(TM) with an A100 in one of its homogeneous partitions modes, but condor_gpu_discovery can't currently detect MIGs when operating in a partitioned mode. The fix is relatively straightforward – use the new NVML-based API for MIG device detection and use NVML APIs to get the information we currently get from CUDA APIs – but is complicated by how we match CUDA devices to the corresponding NVML devices when the `-dynamic` flag is passed, which is by PCI bus ID. Presumably, different instances of the same GPU share a PCI bus ID, so we'll have to switch over to the GUIDs.
Code review - looks good. ship it.
OK, unknown error codes now printed. Resolving.
… it’s not called in as many places as I was expecting, but you’re right; it looks like it’s always called with an explicit error argument, and in a formatted string with that error argument preceding it.
I think that one is usually used in conjunction with printing the error code separately somewhere else. it’s a strerror() replacement.
OK. Not printing the error code is from the original source; should that be changed, as well?