Background#
Recently, I tried to set up a server for running AI in my home data center. Since I only have one device that can handle the 3090, buying a new one is too expensive, so I chose to use ESXi for direct pass-through virtualization.
The virtual machine configuration is as follows:
- EPYC 7302 * 48
- Based on ESXi-7.0U3 platform
- NVIDIA GeForce RTX 3090
- 128GB memory
- Debian GNU/Linux 12 (bookworm) x86_64
Main Process#
First, configure the virtual machine. As usual, set:
- Lock all memory
- Set
hypervisor.cpuid.v0=FALSE
- Set
pciPassthru0.msiEnabled=FALSE
I won't go into too much detail, as you can find these operations in many places. (To be continued) In the virtual machine, first configure the apt source to include non-free-firmware
. In /etc/apt/source.list
, add non-free-firmware
, and it should look like this when finished:
deb https://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware
Then,
sudo apt update
apt search ^nvidia-driver
If everything is normal, you should be able to find
nvidia-driver/unknown 545.23.06-1 amd64
NVIDIA metapackage
Issues#
Usually, on a physical machine, you can directly sudo apt install nvidia-driver
, but this won't work on a virtual machine.
Let's first see what happens if you follow the solution for a physical machine and directly apt install
.
First, install cuda as usual
sudo apt-get install software-properties-common
wget https://developer.download.nvidia.com/compute/cuda/12.3.0/local_installers/cuda-repo-debian12-12-3-local_12.3.0-545.23.06-1_amd64.deb
sudo dpkg -i cuda-repo-debian12-12-3-local_12.3.0-545.23.06-1_amd64.deb
sudo cp /var/cuda-repo-debian12-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3
Then
sudo apt install nvidia-driver
Restart and check the GPU status
nvidia-smi
Check if it's empty and try to detect the GPU
sudo apt install nvidia-detect
nvidia-detect
Then, you will encounter a confusing error
Detected NVIDIA GPUs:
1b:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204] (rev a1)
Checking card: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
Uh oh. Your card is not supported by any driver version up to 545.23.06.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.
This is unreasonable. There is no reason why the latest version of the graphics driver does not support the 3090. However, just to be safe, I checked whether it really doesn't support it on the NVIDIA driver download page. Of course, the result is that it does support the 3090.
Solution#
While I was repeatedly restarting the virtual machine without knowing why this problem occurred, I suddenly saw the key to the problem. Previously, I had been using SSH to remotely connect to the virtual machine, but when I restarted the virtual machine in VMRC, I saw this error message that appeared at the end of the boot process:
[ 12.699654] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 530.41.03 Thu Mar 16 19:48:20 UTC 2023
[ 12.762447] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 530.41.03 Thu Mar 16 19:23:04 UTC 2023
[ 12.871331] [drm] [nvidia-drm] [GPU ID 0x00000b00] Loading driver
[ 12.972022] ACPI Warning: \_SB.PCI0.PE50.S1F0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20210730/nsarguments-61)
[ 13.732645] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x26:0x56:1474)
[ 13.732697] BUG: unable to handle page fault for address: 0000000000004628
[ 13.732784] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
(The above error message is copied from the forum and may vary slightly in reality)
I searched on Google and found the result.
In short, you need to install the open version of the NVIDIA driver instead of the default one. The answer in the forum suggests using the .run
file for installation and adding the parameter -m=kernel-open
. I'm not sure if there is a deb package that can solve this problem.
Before applying this solution, you need to clean up the previous installation.
sudo nvidia-uninstall
sudo apt purge -y '^nvidia-*' '^libnvidia-*'
sudo rm -r /var/lib/dkms/nvidia
sudo apt -y autoremove
sudo update-initramfs -c -k `uname -r`
sudo update-grub2
sudo reboot
Then, download the .run
format driver from the NVIDIA driver download site. And execute
sudo ./NVIDIA-Linux-x86_64-525.116.04.run -m=kernel-open
sudo update-initramfs -u
sudo reboot
Unfortunately, after doing this, the problem still cannot be solved, and nvidia-smi
still can't find anything. However, this does have an effect, and the error message no longer appears during boot.
After searching, I found another part of the solution to the problem in the forum here.
This solution requires adding a line to /etc/modprobe.d/nvidia.conf
(create it if it doesn't exist)
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1
Restart and the problem is solved.
Finally, following the NVIDIA cuDNN installation documentation, I installed cuDNN and successfully ran several models.