Fixed
Details
Time tracking
8.75h loggedAssignee
Todd L MillerTodd L MillerFix versions
Priority
MinorHTCondorCustomerGroup
OtherComponents
Reporter
Todd L MillerTodd L Miller
Details
Details
Time tracking
8.75h logged
Assignee
Todd L Miller
Todd L MillerFix versions
Priority
HTCondorCustomerGroup
Other
Components
Reporter
Todd L Miller
Todd L MillerCreated August 3, 2021 at 2:26 PM
Updated September 1, 2021 at 7:24 PM
Resolved August 19, 2021 at 5:24 PM
Igor Sfiligoi, for IceCube, took a look at the current state of MIG support and had problems trying run glide-ins. He reports two specific issues with
condor_gpu_discovery
(aside from the unreleased fix for looking at MIG devices with fewer than seven partitions):The device name template
GPU-MIG-<UUID>
is not recognized (byclinfo
). I’m not sure where this construction is coming from; it’s possible it’s a typo somewhere ofMIG-GPU-<UUID>/<gpu instance ID>/<compute instance ID>
. Newer device drivers have changed the template to justMIG-<UUID>
, which may be contributing to the problem.Unlike all other device names, MIG device names must include the entire UUID. This is relevant because something, possibly the glide-in, is truncating names set in
CUDA_VISIBLE_DEVICES
even whencondor_gpu_discovery
is passed the-uuid
option, which otherwise does produce names with the complete UUID.Also maybe the base MIG device is reported as existing? (Observed problem: the first slot works but none of the other ones do.) Not sure how this was missed, may be a new driver version issue.