Search Docs...
⌘ K
OverviewDeploymentManagementOperationReferenceGlossary
    ACOS 6.2.0
  • Acrfra Cloud Operation System cluster>
  • ACOS operations and maintenance>
  • For physical disks

Physical disk failures

Failure types

Main type Subtype and description
Unhealthy disk

The physical disk is damaged.

Read or write operations time out or receive no response from the physical disk.

The physical disk experiences an I/O block.

Failing disk

The physical disk shows increased I/O latency, which has not yet reached a timeout state.

Errors occur during read or write operations on the physical disk (fewer than 100 times).

Errors occur during the physical disk verification (fewer than 100 times).

Disk failing the S.M.A.R.T. test

The S.M.A.R.T. test failed.

Software RAID failure

The physical disk in a software RAID group experiences read or write errors or high latency, and has been marked as a failed component by the software RAID.

Short lifespan

The physical disk shows no signs of read or write timeout, high I/O latency, or damage, but is determined by the system to have an insufficient lifespan and may soon pose a risk.

Failure symptoms

When any of the following alerts appear on the main AOC Alert page, it indicates a physical disk failure on the cluster node. Here, the placeholder {XXXX} represents the actual information displayed by the system. Follow the alert message for further actions.

Alert message Default alert level

Host/SCVM { host/SCVM name }: Disk { disk serial number + (disk label) } failure and has been isolated. Please unmount the remaining partitions on it.

Critical

Host/SCVM { host/SCVM name }: Disk { disk serial number + (disk label) } failure, and the last system or meta partition of the host exists on the physical disk.

Critical

Host/SCVM { host/SCVM name }: The physical disk { disk serial number + (disk label) } is in sub-healthy state. Please remove the disk.

Critical

Host/SCVM { host/SCVM name }: Physical disk { disk serial number + (disk label) } S.M.A.R.T. check failed.

Critical

The physical disk { disk serial number + (disk label) } on the host/SCVM { host/SCVM name } experienced I/O block and has been taken offline.

Critical

Hardware fault: the physical disk { disk label } on the host { host name } encountered an I/O error on the sector { sector name }. The latest error log recorded in the OS is: { log information }.

Critical

Hardware fault: the physical disk { disk label } on the host { host name } encountered a critical device error on the sector { sector name }. The latest error log recorded in the OS is: { log information }.

Critical

Hardware fault: the physical disk { disk label } on the host { host name } encountered a command timeout. The latest error log recorded in the OS is: { log information }.

Critical

Host/SCVM { host/SCVM name }: The system partition redundancy is not enough.

Notice

Host/SCVM { host/SCVM name }: The remaining lifetime of the physical disk { disk serial number + (disk label) } is less than { alert threshold }.

Notice

Host/SCVM { host/SCVM name }: The physical disk { disk serial number + (disk label) } is unhealthy. The system will automatically isolate the disk and restore the data to other healthy physical disks. Do not pull the disk.

Notice

Host/SCVM { host/SCVM name }: The physical disk { disk serial number + (disk label) } is in sub-health state. The system will automatically isolate the disk and restore the data to other healthy physical disks. Do not pull the disk.

Notice

Host/SCVM { host/SCVM name }: The physical disk { disk serial number + (disk label) } failure and is now unmounted and can be removed safely.

Info

The system performs corresponding actions when the ACOS cluster detects and reports the physical disk failures described above. You can follow the recommendations below as the next steps:

  • Software RAID failure

    When a Software RAID failure occurs on a physical disk, the disk no longer handles read or write operations for the operating system partition or metadata partition. These operations are instead handled by another physical disk in the software RAID group, which results in insufficient redundancy. You can contact technical support for troubleshooting.

  • Unhealthy disks or failing HDDs

    The system automatically isolates the problematic disk and initiates data recovery. Throughout the process, including when the isolation is started, in progress, and completed, AOC displays the corresponding information on the Alert page. After the system displays a prompt indicating that the isolation is complete and this disk can be safely removed, you can proceed to remove this disk. For details, refer to Replacing a physical disk.

  • Physical disk with I/O blocking

    The system automatically takes the physical disk with I/O blocking offline and initiates data recovery. Once taken offline, the disk stops handling any I/O requests. The offline physical disk cannot be unmounted via the AOC interface until its offline status is cleared. You can contact technical support for troubleshooting.

  • Failing SSD; physical disk with short lifespan; physical disk failing the S.M.A.R.T. test

    The system does not automatically take the problematic physical disk offline. Refer to Replacing a physical disk to manually unmount the physical disk via the AOC interface before you replace and install a new disk on the host.