Failures on system disk
If ACOS is installed on hardware RAID 1, when either of the two system disks fails, the operating system automatically switches to the other system disk for read and write operations, and cluster services remain unaffected throughout the process.
When both system disks fail, the OS partition and metadata partition become completely unavailable, rendering the node unusable. If the virtual machines on the failed node are configured with high availability (HA), they can be migrated to other nodes to continue running normally. If a node failure causes the number of available replicas for certain data blocks to fall below the expected replication factor, data recovery will be triggered.
Failures on cache disk with metadata partition
If ACOS is installed on software RAID 1, when either of the two cache disks that contain metadata partitions fails, the operating system partition and metadata partition on the failed disk become unavailable. However, because the operating system partitions on the two cache disks form a RAID 1, and the two metadata partitions also form a RAID 1, the operating system automatically switches to the partitions on the other disk for read and write operations, and ACOS cluster services remain unaffected throughout the process. Additionally, because the journal partition and cache partition on the failed disk become unavailable, the node's read and write performance will degrade.
When both cache disks containing metadata partitions fail, the operating system partition and metadata partition become completely unavailable, rendering the node unusable. If the virtual machines on the failed node are configured with HA, they can be migrated to other nodes to continue running normally. If the number of available replicas of certain data blocks in the performance tier decreases due to a node issue and falls below the expected replication factor, data recovery will be triggered.
Other cache disks mainly contain the journal partition and cache partition, which are primarily used for recording physical disk write operations and caching data. When a failure occurs, the journal, cache, and data partitions on the failed disk will all become unavailable, with the following impacts:
Journal partition unavailable
The system continues to use the journal partition on other physical disks, resulting in decreased write performance for a single node, while read performance remains unaffected.
Cache partition unavailable
The system can continue to use the cache partitions of other physical disks. If the total available space of the cache partitions in the node is sufficient after the failure, read and write performance will not be affected. However, if the available space of the cache partition is insufficient, more aggressive data tiering and I/O throttling will be triggered, resulting in degraded read and write performance on the node. Data in the performance tier of the cache partition, which is persistently stored, will trigger data recovery if it suffers replication loss due to physical disk failure. The system prioritizes rebuilding data on other nodes to restore the expected number of replicas. If available space exists only on the failed node, the data can also be rebuilt on other healthy data partitions within the same node.
Note:
When any cache disk fails, it does not cause other cache disks or data disks to go offline and will not reduce available storage capacity.
Data partitions use replication or erasure coding technology to ensure data block redundancy across several node data partitions. On a single node, if a failure with one or multiple data disks exists, the system will continue operating normally because enough data replicas or erasure coding blocks on other nodes exist. These allow the use of data partitions from other physical disks, ensuring cluster operations are not impacted. Meanwhile, a data disk failure triggers data recovery. The data from the healthy disks is automatically restored to available space on other nodes, ensuring that the replication factor or number of erasure coding blocks meets expectations. If available space exists only on the failed node, the data can also be rebuilt on other healthy data partitions within the same node.