Problems with MIG and CUDA_VISIBLE_DEVICES

Description

Igor Sfiligoi, for IceCube, took a look at the current state of MIG support and had problems trying run glide-ins. He reports two specific issues with condor_gpu_discovery (aside from the unreleased fix for looking at MIG devices with fewer than seven partitions):

  • The device name template GPU-MIG-<UUID> is not recognized (by clinfo). I’m not sure where this construction is coming from; it’s possible it’s a typo somewhere of MIG-GPU-<UUID>/<gpu instance ID>/<compute instance ID>. Newer device drivers have changed the template to just MIG-<UUID>, which may be contributing to the problem.

  • Unlike all other device names, MIG device names must include the entire UUID. This is relevant because something, possibly the glide-in, is truncating names set in CUDA_VISIBLE_DEVICES even when condor_gpu_discovery is passed the -uuid option, which otherwise does produce names with the complete UUID.

  • Also maybe the base MIG device is reported as existing? (Observed problem: the first slot works but none of the other ones do.) Not sure how this was missed, may be a new driver version issue.

Activity

Show:

John (TJ) Knoeller August 19, 2021 at 5:16 PM

Code Review : looks good. I like having a function to get the parent GPUids for the MIG devices that are present.

Todd L Miller August 19, 2021 at 4:21 PM

Looks like the patch works fine with driver version 450, and that the old-style MIG identifiers work in CUDA_VISIBLE_DEVICES there.

Todd L Miller August 6, 2021 at 7:23 PM

It looks to me like the places we use GPU- prefix after my patches all do so correctly.

This branch needs to be tested against driver version 450 or 460 to make sure that the new code to suppress output about the MIG parent device(s) works. We should also double-check to make sure that setting CUDA_VISIBLE_DEVICES to one of the old-style MIG identifiers works (run nvidia-smi to make sure the gpu spinner is using the correct GPU).

Todd L Miller August 6, 2021 at 4:56 PM

We look for the GPU- prefix in a lot of places. Make sure they all remain appropriate, or add the MIG- prefix as another match as necessary.

Todd L Miller August 4, 2021 at 7:49 PM

If the base MIG device is indeed being reported, that may just be a consequence of new-style identifiers for MIG instances, since we can no longer count on the MIG instance names to tell us which MIG the instance is a part of (the code in condor_gpu_discovery to suppress reporting parent MIG device(s) depends on this).

Fixed

Details

Time tracking

8.75h logged

Assignee

Fix versions

Priority

HTCondorCustomerGroup

Other

Components

Reporter

Created August 3, 2021 at 2:26 PM
Updated September 1, 2021 at 7:24 PM
Resolved August 19, 2021 at 5:24 PM