Failures on system disk
If ACOS is installed on hardware RAID 1, when either of the two system disks fails, the operating system automatically switches to the other system disk for read and write operations, and cluster services remain unaffected throughout the process.
When both system disks fail, the OS partition and metadata partition become completely unavailable, rendering the node unusable. If the virtual machines on the failed node are configured with high availability (HA), they can be migrated to other nodes to continue running normally. If a node failure causes the number of available replicas for certain data blocks to fall below the expected replication factor, data recovery will be triggered.
Failures on data disk with metadata partition
If ACOS is installed on software RAID 1, when either of the two data disks that contain metadata partitions fails, the operating system partition and metadata partition on the failed disk become unavailable. However, because the operating system partitions on the two data disks form a RAID 1, and the two metadata partitions also form a RAID 1, the operating system automatically switches to the partitions on the other disk in the node for read and write operations, and ACOS cluster services remain unaffected throughout the process. If the virtual machines on the failed node are configured with HA, they can be migrated to other nodes to continue running normally.
When both data disks containing metadata partitions fail, the operating system partition and metadata partition become completely unavailable, rendering the node unusable. If a node failure causes the number of available replicas for certain data blocks to fall below the expected replication factor, data recovery will be triggered.
Other data disks mainly contain the journal partition and cache partition, which are primarily used for recording physical disk write operations and caching data. When a failure occurs, the journal and data partitions on the failed disk will all become unavailable, with the following impacts:
Journal partition unavailable
The system continues to use the journal partition on other physical disks, resulting in decreased write performance for a single node, while read performance remains unaffected.
Data partition unavailable
The system can continue to use the data partitions of other physical disks. The data partitions adopt the replication mechanism to ensure that data blocks are redundantly stored across data partitions on multiple nodes, providing single-node fault tolerance.
On a single node, if one or multiple physical disks fail, the data partitions that become unavailable will not affect the node's storage performance, as sufficient data replicas are still available on other nodes.
A data disk failure will trigger data recovery, which automatically restores the data from healthy physical disks to available space on other nodes, ensuring that the number of replicas meets expectations. If available space exists only on the failed node, the data can also be rebuilt on other healthy data partitions within the same node.