ACOS 6.3.0

Acrfra Cloud Operation System cluster>
Failure scenarios>
Rack and chassis failures

Replication mode

ACOS supports rack-aware topology, using topology information to distribute data replicas as evenly as possible across different racks, chassis, and hosts. Therefore, even in extreme situations such as rack power failures or damage, surviving replicas can still be found on hosts in other racks, improving the overall fault tolerance of the cluster.

In addition to actively selecting the optimal location for replica placement, the system also automatically migrates data replicas between hosts during cluster operation based on replica distribution and rack topology information. Replicas are preferentially distributed across different racks, then across different chassis within the same rack, and lastly across different hosts within the same chassis, achieving optimal data replica topology safety.

Generally, the greater the number of racks and chassis in the cluster, the stronger the fault tolerance provided. For example, in common 2-replica and 3-replica clusters, the relationship between the number of replicas, racks, chassis, and the types of tolerable failures is shown in the table below:

Replication factor	Chassis count	Rack count	Tolerable fault type
2 replicas	1	1	Any single host failure
	≥ 2	1	Any single chassis failure
	≥ 2	≥ 2	Any single rack failure
3 replicas	1	1	Any two host failures
	2	1	Any single chassis failure
	≥ 3	1	Any two chassis failures
	≥ 2	2	Any single rack failure
	≥ 3	≥ 3	Any two rack failures

For failed racks and chassis:

If a virtual machine has HA enabled, HA will be triggered after the failure lasts for 3 minutes, and the virtual machine will be automatically migrated to other available nodes. For virtual machines without HA configured, manual migration to an available host is required to restart the virtual machine before it can continue to be used.
Relevant alerts will appear in the management interface.