Active-active cluster failures

This section describes the disaster recovery mechanisms of an active-active cluster in the event of node and network failures, as well as the impact on services. For physical disk and NIC failures, the impact and handling methods are the same as those in clusters without the active-active feature enabled. For details, refer to the corresponding sections of this manual.

This section uses a typical active-active cluster in the same geographical area, consisting of nine nodes as an example, in which:

The primary availability zone A, secondary availability zone B, and witness node C are located in three different physical data center locations: IDC A, IDC B, and IDC C.
The virtualization platform is the AVE platform or VMware ESXi platform, with HA enabled and placement group policies configured.

In an active-active cluster, both the cluster-level primary availability zone (IDC A) and the secondary availability zone (IDC B) can run workloads. Each virtual machine selects the availability zone in which it normally runs as its primary availability zone. The system behavior in response to node failures in either availability zone is generally the same. In this example, it is assumed that the virtual machines are expected to run stably in IDC A over the long term; therefore, IDC A is the primary availability zone for the virtual machines.

When HA is enabled in the cluster and placement group policies are configured:

Default behavior of placement group policies: HA is preferentially performed on all nodes within the virtual machine's current availability zone. If conditions permit, HA may also be performed in the other availability zone.
If there are special restrictions on placement group policy in the actual deployment (such as only allowing startup within the primary availability zone), then the HA behavior must comply with these restrictions.
Data-level failback behavior is not directly affected by placement group policies.

Note:

Regardless of the type of failure, only up to 2 replications of data are preserved in a single availability zone, and it will not be restored to 3 replications.