Replacing a node

When a node fails in an ACOS (AVE) cluster, you can refer to this section to replace the failed node with a new one. Data migration is not required—the new node will retain the same data as the original failed node after the replacement.

Preparation

Record the hostname, network configuration, and physical disk usage of the failed node.
Migrate all virtual machines from the failed node to other nodes.
Ensure that the new node meets the hardware requirements. Refer to the Hardware requirements section.
Install the same version of the ACOS software on the new node. For details, see the Installing ACOS on a physical server section.

Procedure

Identify the roles of the failed nodes by checking their roles in the host list on AOC.
- If the failed nodes include master nodes, proceed to step 2.
- If all failed nodes are storage nodes, move directly to step 3 and skip step 6.

Convert each failed master node to a storage node one by one. Run the following command on any node in the cluster other than the failed nodes, ensuring that only one node is undergoing role conversion at any given time.

zbs-cluster convert_to_storage [--is_alive <Boolean>] --ignore_recover_status [--offline_host_ips <offline_host_ips>] <data_ip>

[--is_alive <Boolean>]: Optional. If the failed master node is not connected to the storage network of other nodes in the cluster, you must specify this parameter and set <Boolean> to False. If connectivity exists, you may omit this parameter or set <Boolean> to True.
[--offline_host_ips <offline_host_ips>]: Optional. This parameter must be specified when the failed master node is in an unresponsive state. <offline_host_ips> refers to the storage IP addresses of all unresponsive nodes in the cluster at the time. Multiple IP addresses should be separated by half-width commas (,).
<data_ip>: Refers to the storage IP of the failed master node.

Confirm that the failed nodes are in a Not responding state. You can view the status of failed nodes on the host list or overview page in AOC.
- If all failed nodes are in a Not responding state, proceed directly to step 4.
- If some of the failed nodes are not in a Not responding state, click the ellipsis (...) to the right of the node in the AOC host list, and select Shut down. In the dialog that opens, enter the Reason, and then click Shut down. If shutting down via AOC fails, you can shut down the server via IPMI or by shutting down the server to place the node in an Not responding state.

Remove all failed nodes one by one. Run the following command on any node in the cluster other than the failed nodes, and make sure that only one node is being removed at a time.

zbs-deploy-manage meta_remove_node [--offline_host_ips <offline_host_ips>] --keep_zbs_meta <data_ip>

[--offline_host_ips <offline_host_ips>]: Optional. This parameter must be specified when the failed node is in an unresponsive state. <offline_host_ips> refers to the storage IP addresses of all unresponsive nodes in the cluster at the time. Multiple IP addresses should be separated by half-width commas (,).
<data_ip>: Refers to the storage IP of the failed node.

Add the new node to the cluster by following the instructions in the Adding hosts to the cluster section. Use the previously recorded hostname and network configuration of the failed node when configuring the new node. The usage of physical disks can follow the configuration of the failed node or be adjusted as needed.
If the removed failed node was a master node, convert the new node from a storage node to a master node. For details, refer to the Converting the node role section.
Migrate the previously relocated virtual machines back to the new node as needed.