The version of the node vGPU driver needs to be compatible with the kernel version. Therefore, if the node already has a vGPU driver installed, after upgrading or rolling back the kernel version, the vGPU driver also needs to be updated to a version compatible with the new kernel.
Preparation
Execute the command uname -a to check the current kernel version of the host. Obtain the driver installation package based on the kernel version and the required vGPU driver version.
All virtual machines with vGPU mounted on the host to be upgraded are in a powered-off state.
If an ACOS (AVE) cluster has deployed a file storage cluster and the file controller is located on the current host, disable high availability for the file controller.
Complete the self-check according to Entering maintenance mode to ensure that all check items meet the requirements.
Procedure
In the AOC host list, click the ellipsis (...) on the right of the target host, and select Enter maintenance mode. After the system automatically checks and returns the check results, confirm that the host enters maintenance mode.
Upload the driver installation package to the /home/acos path on the host.
Execute the following command to delete the existing driver.
rpm -e Nvidia-vGPUAfter deletion, execute the command rpm -qi Nvidia-vGPU to confirm that the driver has been successfully deleted. The example output is as follows:
$rpm -qi Nvidia-vGPU
package Nvidia-vGPU is not installedExecute the following command to complete the installation of the new driver package.
cd /home/acos && rpm -ivh <new_rpm_name>Use the command rpm -qi Nvidia-vGPU to view the installed driver information and confirm successful installation. The example output is as follows:
$rpm -qi Nvidia-vGPU
Name : Nvidia-vGPU
Version : 14.2
Release : 4.19.90_2307.3.0.el7.v60_5
Architecture: x86_64
Install Date: Fri 03 Mar 2023 02:58:24 PM CST
Group : Unspecified
Size : 69359565
License : GPL
Signature : (none)
Source RPM : Nvidia-vGPU-14.2-4.19.90_2307.3.0.el7.v60_5.x86_64.src.rpm
Build Date : Thu 09 Feb 2023 11:19:37 AM CST
Build Host : 5e21f91597c8
Relocations : (not relocatable)
URL : https://www.nvidia.com
Summary : Nvidia vGPU driver
Description :
Dynamically create, increase and shrinkRestart the host through AOC or IPMI management platform.
After the restart is complete, execute the command nvidia-smi to output GPU information, which proves that the driver is functioning normally. The example output is as follows:
$nvidia-smi
Fri Apr 14 10:27:43 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.03 Driver Version: 510.85.03 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:2F:00.0 Off | 0 |
| N/A 32C P0 25W / 250W | 50MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+Check the time the current host has been in maintenance mode on the host overview page in AOC, and click Exit maintenance mode on the right side of the prompt.
The system pops up a check dialog box to perform an exit check, and returns the check results of all check items after the check is completed.
If the host contains a file controller for the AFS service, please power on the file controller and enable high availability for the file controller.