Search Docs...
⌘ K
OverviewDeploymentManagementOperationReferenceGlossary

Updating the vGPU driver

The version of the node vGPU driver needs to be compatible with the kernel version. Therefore, if the node already has a vGPU driver installed, after upgrading or rolling back the kernel version, the vGPU driver also needs to be updated to a version compatible with the new kernel.

Preparation

  • Execute the command uname -a to check the current kernel version of the host. Obtain the driver installation package based on the kernel version and the required vGPU driver version.

  • All virtual machines with vGPU mounted on the host to be upgraded are in a powered-off state.

  • If an ACOS (AVE) cluster has deployed a file storage cluster and the file controller is located on the current host, disable high availability for the file controller.

  • Complete the self-check according to Entering maintenance mode to ensure that all check items meet the requirements.

Procedure

  1. In the AOC host list, click the ellipsis (...) on the right of the target host, and select Enter maintenance mode. After the system automatically checks and returns the check results, confirm that the host enters maintenance mode.

  2. Upload the driver installation package to the /home/acos path on the host.

  3. Execute the following command to delete the existing driver.

    rpm -e Nvidia-vGPU
  4. After deletion, execute the command rpm -qi Nvidia-vGPU to confirm that the driver has been successfully deleted. The example output is as follows:

    $rpm -qi Nvidia-vGPU
    package Nvidia-vGPU is not installed
  5. Execute the following command to complete the installation of the new driver package.

    cd /home/acos && rpm -ivh <new_rpm_name>
  6. Use the command rpm -qi Nvidia-vGPU to view the installed driver information and confirm successful installation. The example output is as follows:

    $rpm -qi Nvidia-vGPU
    Name        : Nvidia-vGPU
    Version     : 14.2
    Release     : 4.19.90_2307.3.0.el7.v60_5
    Architecture: x86_64
    Install Date: Fri 03 Mar 2023 02:58:24 PM CST
    Group       : Unspecified
    Size        : 69359565
    License     : GPL
    Signature   : (none)
    Source RPM  : Nvidia-vGPU-14.2-4.19.90_2307.3.0.el7.v60_5.x86_64.src.rpm
    Build Date  : Thu 09 Feb 2023 11:19:37 AM CST
    Build Host  : 5e21f91597c8
    Relocations : (not relocatable)
    URL         : https://www.nvidia.com
    Summary     : Nvidia vGPU driver
    Description :
    Dynamically create, increase and shrink
  7. Restart the host through AOC or IPMI management platform.

  8. After the restart is complete, execute the command nvidia-smi to output GPU information, which proves that the driver is functioning normally. The example output is as follows:

    $nvidia-smi
    Fri Apr 14 10:27:43 2023
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 510.85.03    Driver Version: 510.85.03    CUDA Version: N/A      |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-PCIE...  On   | 00000000:2F:00.0 Off |                    0 |
    | N/A   32C    P0    25W / 250W |     50MiB / 16384MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
  9. Check the time the current host has been in maintenance mode on the host overview page in AOC, and click Exit maintenance mode on the right side of the prompt.

  10. The system pops up a check dialog box to perform an exit check, and returns the check results of all check items after the check is completed.

    • All check items meet the requirements, allowing the host to exit maintenance mode. Click Exit maintenance mode.
    • Some check results do not meet the requirements. Refer to Exiting maintenance mode for adjustments and then try to exit host maintenance mode again.
  11. If the host contains a file controller for the AFS service, please power on the file controller and enable high availability for the file controller.