Installing NVIDIA graphics card drivers on Debian virtual machines in ESXi

Background#

Recently, I tried to set up a server for running AI in my home data center. Since I only have one device that can handle the 3090, buying a new one is too expensive, so I chose to use ESXi for direct pass-through virtualization.

The virtual machine configuration is as follows:

EPYC 7302 * 48
Based on ESXi-7.0U3 platform
NVIDIA GeForce RTX 3090
128GB memory
Debian GNU/Linux 12 (bookworm) x86_64

Main Process#

First, configure the virtual machine. As usual, set:

Lock all memory
Set hypervisor.cpuid.v0=FALSE
Set pciPassthru0.msiEnabled=FALSE

I won't go into too much detail, as you can find these operations in many places. (To be continued) In the virtual machine, first configure the apt source to include non-free-firmware. In /etc/apt/source.list, add non-free-firmware, and it should look like this when finished:

deb https://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware

Then,

sudo apt update
apt search ^nvidia-driver

If everything is normal, you should be able to find

nvidia-driver/unknown 545.23.06-1 amd64
  NVIDIA metapackage

Issues#

Usually, on a physical machine, you can directly sudo apt install nvidia-driver, but this won't work on a virtual machine.

Let's first see what happens if you follow the solution for a physical machine and directly apt install.

First, install cuda as usual

sudo apt-get install software-properties-common
wget https://developer.download.nvidia.com/compute/cuda/12.3.0/local_installers/cuda-repo-debian12-12-3-local_12.3.0-545.23.06-1_amd64.deb
sudo dpkg -i cuda-repo-debian12-12-3-local_12.3.0-545.23.06-1_amd64.deb
sudo cp /var/cuda-repo-debian12-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3

Then

sudo apt install nvidia-driver

Restart and check the GPU status

nvidia-smi

Check if it's empty and try to detect the GPU

sudo apt install nvidia-detect
nvidia-detect

Then, you will encounter a confusing error

Detected NVIDIA GPUs:
1b:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204] (rev a1)

Checking card:  NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
Uh oh. Your card is not supported by any driver version up to 545.23.06.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.

This is unreasonable. There is no reason why the latest version of the graphics driver does not support the 3090. However, just to be safe, I checked whether it really doesn't support it on the NVIDIA driver download page. Of course, the result is that it does support the 3090.

Solution#

While I was repeatedly restarting the virtual machine without knowing why this problem occurred, I suddenly saw the key to the problem. Previously, I had been using SSH to remotely connect to the virtual machine, but when I restarted the virtual machine in VMRC, I saw this error message that appeared at the end of the boot process:

[   12.699654] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  530.41.03  Thu Mar 16 19:48:20 UTC 2023
[   12.762447] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  530.41.03  Thu Mar 16 19:23:04 UTC 2023
[   12.871331] [drm] [nvidia-drm] [GPU ID 0x00000b00] Loading driver
[   12.972022] ACPI Warning: \_SB.PCI0.PE50.S1F0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20210730/nsarguments-61)
[   13.732645] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x26:0x56:1474)
[   13.732697] BUG: unable to handle page fault for address: 0000000000004628
[   13.732784] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0

(The above error message is copied from the forum and may vary slightly in reality)

I searched on Google and found the result.

In short, you need to install the open version of the NVIDIA driver instead of the default one. The answer in the forum suggests using the .run file for installation and adding the parameter -m=kernel-open. I'm not sure if there is a deb package that can solve this problem.

Before applying this solution, you need to clean up the previous installation.

sudo nvidia-uninstall
sudo apt purge -y '^nvidia-*' '^libnvidia-*'
sudo rm -r /var/lib/dkms/nvidia
sudo apt -y autoremove
sudo update-initramfs -c -k `uname -r`
sudo update-grub2
sudo reboot

Then, download the .run format driver from the NVIDIA driver download site. And execute

sudo ./NVIDIA-Linux-x86_64-525.116.04.run -m=kernel-open
sudo update-initramfs -u
sudo reboot

Unfortunately, after doing this, the problem still cannot be solved, and nvidia-smi still can't find anything. However, this does have an effect, and the error message no longer appears during boot.

After searching, I found another part of the solution to the problem in the forum here.

This solution requires adding a line to /etc/modprobe.d/nvidia.conf (create it if it doesn't exist)

options nvidia NVreg_OpenRmEnableUnsupportedGpus=1

Restart and the problem is solved.

Finally, following the NVIDIA cuDNN installation documentation, I installed cuDNN and successfully ran several models.

Successful result