Install drivers on a cloud server with GPUs

Install drivers on the cloud server with the GPU

For your information

These are instructions with an example of installing drivers on a cloud server that is created from a pre-built Ubuntu 24.04 LTS 64-bit image.

You must install drivers for NVIDIA® GPUs on a cloud server with the GPU for stable operation.

If you have created a cloud server from a pre-built GPU-optimized image, the drivers are already installed, no additional installation is required. GPU-optimized pre-built images:

Ubuntu 24.04 LTS 64-bit GPU driver;
Ubuntu 24.04 LTS 64-bit CUDA 11.8 Docker;
Ubuntu 24.04 LTS 64-bit CUDA 12.8 Docker;
Ubuntu 22.04 LTS 64-bit GPU driver;
Ubuntu 22.04 LTS 64-bit CUDA 11.8 Docker;
Ubuntu 22.04 LTS 64-bit CUDA 12.8 Docker;
Data Science VM (Ubuntu 22.04 LTS 64-bit);
Data Analytics VM (Ubuntu 22.04 LTS 64-bit).

Install drivers⁠

Connect to the cloud server with the GPU.

Install the ubuntu-drivers-common package:

sudo apt install -y ubuntu-drivers-common alsa-utils

Check out the recommended driver version:

sudo ubuntu-drivers devices

A list of versions will appear in the response. The recommended version will be marked as recommended. Copy the recommended version.

Example for NVIDIA® Tesla T4 GPU with recommended version nvidia-driver-550:

== /sys/devices/pci0000:00/0000:00:06.0 ==
modalias : pci:v000010DEd00001EB8sv000010DEsd000012A2bc03sc02i00
vendor   : NVIDIA Corporation
model    : TU104GL [Tesla T4]
manual_install: True
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-535-server - distro non-free
driver   : nvidia-driver-470-server - distro non-free
driver   : nvidia-driver-470 - distro non-free
driver   : nvidia-driver-550 - third-party non-free recommended
driver   : nvidia-driver-418-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

Optional: verify that the selected driver version is higher than the minimum compatible version for the cloud server GPU architecture:
```
sudo apt-cache search nvidia-driver-*
```
A list of compatible driver versions will appear in the response. To see the GPU architecture, see the Create a Cloud Server with a GPU instructions, and to see if the driver version and architecture match, see the CUDA Compatibility instructions in the NVIDIA® CUDA Compatibility documentation.
If your GPU architecture is Pascal (such as the NVIDIA® GTX 1080), add the NVIDIA® Personal Package Archive repository to the cloud server:
```
sudo add-apt-repository ppa:graphics-drivers/ppa -y
```
Set the kernel headers:
```
sudo apt update
for kernel in $(linux-version list); do apt install -y "linux-headers-<kernel-version>"; done
```
Specify <kernel-version> — kernel version. The list of kernel versions can be viewed with the command apt-cache search linux-image.
Install the driver:
```
sudo apt install -y <driver_version>
```
Specify < driver_version > is the driver version you copied in step 3.

Example of installing the recommended version of nvidia-driver-550 for NVIDIA® Tesla T4 GPUs:
```
sudo apt install -y nvidia-driver-550
```

Check that the driver is installed and working:

nvidia-smi

The response will show NVIDIA-SMI versions, driver versions, and a CUDA version that is compatible with the current driver version but is not installed on the system. The CUDA Runtime API and CUDA Toolkit are installed separately and are not included in the nvidia-driver package. Example answer:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:06.0 Off |                    0 |
| N/A   41C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Open the configuration file of the unattended-upgrades package that handles security updates:
```
nano /etc/apt/apt.conf.d/50unattended-upgrades
```
Disable package updates for NVIDIA®. To do this, add a block to the file:
```
Unattended-Upgrade::Package-Blacklist {
    "linux-";
    "nvidia-";
};
```
Exit the nano text editor with your changes saved: press Ctrl+X and then Y+Enter.
Optional: lock the kernel version to disable kernel update. Updating the kernel version may cause errors in GPU drivers.

Commit kernel version⁠

For your information

In the ready images with pre-installed drivers, except for Data Analytics VM (Ubuntu 22.04 LTS 64-bit) and Data Science VM (Ubuntu 22.04 LTS 64-bit), the kernel version is already fixed.

Drivers are compiled with the source code headers of the current kernel version during the installation process. Changing the kernel version will cause the GPU driver to fail. In this case, the following error may occur in the output of the nvidia-smi command:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

To disable kernel updates, commit the kernel version in the apt package manager settings. After committing, you will be able to update the kernel version.

Open the CLI.

Create a pin-linux-kernel-nvidia-dkms file in the /etc/apt/preferences.d directory to commit the version of the linux-headers and linux-image packages:

cat <<EOF > /etc/apt/preferences.d/pin-linux-kernel-nvidia-dkms
Package: linux-image-*
Pin: version *
Pin-Priority: -1

Package: linux-headers-*
pin: version *
Pin-Priority: -1
EOF

Update the kernel version after committing⁠

Once you commit a kernel version, you cannot update it. To download security updates, performance improvements, and add new features, delete the kernel version commit file and upgrade the version.

Open the CLI.
Delete the file you created to commit the kernel version:
```
rm /etc/apt/preferences.d/pin-linux-kernel-nvidia-dkms
```
Update the kernel version:
```
apt install linux-image-<kernel-version>
```
Specify <kernel-version> — kernel version. The list of kernel versions can be viewed with the command apt-cache search linux-image.
Reboot the cloud server.
Set the kernel headers:
```
apt install linux-headers-$(uname -r)
```
Once the kernel headers are installed, the dkms utility will run and automatically rebuild the NVIDIA modules for the new kernel version.

Install drivers on the cloud server with the GPU

Install drivers⁠​

Commit kernel version⁠​

Update the kernel version after committing⁠​

Install drivers⁠

Commit kernel version⁠

Update the kernel version after committing⁠