API Doc
Search Docs...
⌘ K
OverviewDeploymentManagementOperationReferenceGlossary

Single-node physical disk failures

Data disk containing a metadata partition or system disk failures

  • Failures on system disk

    If ACOS is installed on hardware RAID 1, when either of the two system disks fails, the operating system automatically switches to the other system disk for read and write operations, and cluster services remain unaffected throughout the process.

    When both system disks fail, the OS partition and metadata partition become completely unavailable, rendering the node unusable. If the virtual machines on the failed node are configured with high availability (HA), they can be migrated to other nodes to continue running normally. If a node failure causes the number of available replicas for certain data blocks to fall below the expected replication factor, data recovery will be triggered.

  • Failures on data disk with metadata partition

    If ACOS is installed on software RAID 1, when either of the two data disks that contain metadata partitions fails, the operating system partition and metadata partition on the failed disk become unavailable. However, because the operating system partitions on the two data disks form a RAID 1, and the two metadata partitions also form a RAID 1, the operating system automatically switches to the partitions on the other disk in the node for read and write operations, and ACOS cluster services remain unaffected throughout the process. Additionally, because the journal partition and cache partition on the faulty disk become unavailable, the node's read and write performance will degrade.

    When both data disks containing metadata partitions fail simultaneously, the operating system partition and metadata partition become completely unavailable, rendering the node unusable. If the virtual machines on the failed node are configured with HA, they can be migrated to other nodes to continue running normally.

    If a physical disk failure causes the number of available replicas or coded blocks for certain data blocks to fall below the expected replication factor or number of coded blocks, data recovery will be triggered.

Data disk failures

Data disks mainly consist of the journal partition, cache partition, and data partition, which are used to record physical disk write operations, cache data, and store data, respectively. When a data disk fails, the journal partition, cache partition, and data partition on the failed disk will all become unavailable, with the impacts as follows:

  • Journal partition unavailable

    The system continues to use the journal partition on other physical disks, resulting in decreased write performance for the node where the faulty disk is located, while read performance remains unaffected.

  • Cache partition unavailable

    The system can continue to use the cache partitions of other physical disks. If the total available space of the cache partitions in the node is sufficient after the failure, read and write performance will not be affected. However, if the available space of the cache partition is insufficient, more aggressive data tiering and I/O throttling will be triggered, resulting in degraded read and write performance on the node. Data in the performance tier of the cache partition, which is persistently stored, will trigger data recovery if it suffers replication loss due to physical disk failure. The system prioritizes rebuilding data on other nodes and restoring the expected number of replicas. If available space exists only on the failed node, the data can also be rebuilt on other healthy data partitions within the same node.

  • Data partition unavailable

    Data partitions use replication or erasure coding technology to ensure data block redundancy across several node data partitions. On a single node, if a failure with one or multiple data disks exists, the system will continue operating normally because enough data replicas or erasure coding blocks on other nodes exist. These allow the use of data partitions from other physical disks, ensuring cluster operations are not impacted. Meanwhile, a data disk failure triggers data recovery. The data from the healthy disks is automatically restored to available space on other nodes, ensuring that the replication factor or number of erasure coding blocks meets expectations. If available space exists only on the failed node, the data can also be rebuilt on other healthy data partitions within the same node.

    In addition, as long as there is sufficient available space in the cache partition on the data disk, failures that make the data partition unavailable will not affect the node's storage performance.