Storage node failures

When a storage node fails, it will affect the available data replicas or coded blocks in the ACOS cluster.

Failures on a single storage node

ACOS adopts multi-replica or erasure coding technologies to ensure that data blocks are redundantly stored across multiple nodes, providing single-node fault tolerance.

If the redundancy strategy uses multi-replica, each piece of data is written simultaneously to 2 or 3 copies on different nodes. Therefore, in the event of a single-node failure, the other nodes still retain complete data copies, preventing total data loss and ensuring business continuity and reliability. Even if only the last valid replica exists, the data will be automatically restored to other available nodes.

If the redundancy strategy uses erasure coding, data is encoded into multiple coded blocks (including original data blocks and parity blocks), each stored on different nodes. When a single-node failure occurs, the missing original data blocks can be reconstructed using the remaining coded blocks and the erasure coding algorithm, ensuring business continuity and reliability.

The impact of a storage node failure can be categorized as follows:

Triggering host abnormal alerts

When a node fails, host abnormal alarms will be observed on the interface.
Cluster performance degradation

Since the ACOS cluster scales linearly, a node failure usually leads to a decrease in cluster storage performance. The extent of degradation corresponds to the proportion of performance provided by the failed node relative to the whole cluster.
Triggering data recovery

After a node fails, the data replicas or coded blocks stored on that node become invalid. Once the system detects the invalid replicas or coded blocks, data recovery is triggered, restoring the data to available storage space on other healthy nodes. During data recovery, I/O on some nodes may experience additional degradation due to recovery traffic, but performance will stabilize once recovery is complete.

Failures on multiple storage nodes

Note:

In most cases, data in the ABS storage service follow a localized distribution policy. When the number of cluster nodes is large, failures of 1 to 3 nodes will not cause data loss for all logical data in the cluster, but only affect a small portion of the data.

When using a multi-replica strategy, the impact of multiple node failures on the data is shown in the table below:

Replication factor	Failed node count	Failure impact
2	1	Data is safe
2	≥ 2	May cause partial data loss
3	≤ 2	Data is safe
3	≥ 3	May cause partial data loss

When using an erasure coding strategy, the impact of multiple node failures on the data is shown in the table below:

Original data block count K	Parity block count M	Faulty node count	Failure impact
≥ 2	= 1	1	Data is safe
	= 1	≥ 2	May cause partial data loss
	= 2	≤ 2	Data is safe
	= 2	≥ 3	May cause partial data loss
≥ 3	= 3	≤ 3	Data is safe
≥ 3	= 3	≥ 4	May cause partial data loss
≥ 4	= 4	≤ 3	Data is safe
		= 4	Data loss may occur with a very low probability
		≥ 5	May cause partial data loss

In this article

Failures on a single storage node
Failures on multiple storage nodes