Configuring nodes

The VM-based workload cluster contains a control plane node group and at least one worker node group, with nodes in the same group having the same configurations.

Node group autoscaling

If the VM-based workload cluster being created requires a worker node group with the Autoscaling node count type, you should enable Node group autoscaling to set the auto scaling range for the worker node group in subsequent steps.

Note:

You can also enable or disable Node group autoscaling after the VM-based workload cluster is created.

Control plane node group

Set the node group name.
Specify the number of nodes for the node group. You can choose 1, 3, or 5. To ensure control plane high availability for the VM-based workload cluster, it is recommended to select 3 or 5.

Configure the resources for each node within the node group.

Parameter	Description
CPU	The number of vCPUs allocated per node in the group. The default is 4 vCPUs, with a minimum allocation of 2 vCPUs.
Memory	The amount of memory allocated per node in the group. The default is 8 GiB, with a minimum allocation of 6 GiB.
Storage	The amount of storage allocated per node in the group, which is the disk capacity for the corresponding virtual machine. The default is 200 GiB and cannot be modified.

(Optional) If the number of nodes in the node group is greater than 1, you can choose to enable Faulty node auto replacement. Once enabled, if the proportion of faulty nodes in the node group does not exceed the maximum threshold for faulty nodes, the system will automatically remove the faulty nodes and create new ones. Enabling this feature requires configuring the following parameters:

Parameter	Description
Faulty node detection	Conditions for determining faulty nodes. You should tick the required conditions and set the duration threshold. Note: If you need to set the duration threshold for nodes in the Not Ready or in the Unknown state, to avoid virtual machines that have already recovered from faults through high availability (HA) being identified as faulty nodes, it is recommended to set the threshold to no less than 5 minutes. This is because the virtual machine HA feature requires a certain recovery time (130 seconds), and the node also needs some time to become available again. If you need to set the duration threshold for nodes not started, to avoid nodes being identified as faulty nodes before they start, it is recommended to set the threshold to no less than 5 minutes.
Maximum percentage of faulty nodes	The maximum percentage of faulty nodes in the node group that can be automatically replaced. If the number of faulty nodes exceeds this percentage, automatic replacement is not supported. The value cannot be greater than 40%. For example, if the total number of nodes in the group is 5, and this setting is 40%, then when the number of faulty nodes in the group is less than or equal to 2, they will be automatically replaced; if it is greater than 2, they will not be replaced.

Parameter

Description

Faulty node detection

Conditions for determining faulty nodes. You should tick the required conditions and set the duration threshold.

Note:

If you need to set the duration threshold for nodes in the Not Ready or in the Unknown state, to avoid virtual machines that have already recovered from faults through high availability (HA) being identified as faulty nodes, it is recommended to set the threshold to no less than 5 minutes. This is because the virtual machine HA feature requires a certain recovery time (130 seconds), and the node also needs some time to become available again.
If you need to set the duration threshold for nodes not started, to avoid nodes being identified as faulty nodes before they start, it is recommended to set the threshold to no less than 5 minutes.

Maximum percentage of faulty nodes

The maximum percentage of faulty nodes in the node group that can be automatically replaced. If the number of faulty nodes exceeds this percentage, automatic replacement is not supported. The value cannot be greater than 40%. For example, if the total number of nodes in the group is 5, and this setting is 40%, then when the number of faulty nodes in the group is less than or equal to 2, they will be automatically replaced; if it is greater than 2, they will not be replaced.

Worker node group

You need to create at least one worker node group.

Set the node group name.

Specify the number of nodes for the node group.

Fixed count: Enter the number of nodes, with a minimum of 1.

Autoscaling (Only supported when Node group autoscaling is enabled): Enter the minimum and maximum numbers of nodes to specify the node count range. The minimum number of nodes cannot be less than 3. After the cluster is created, the initial number of nodes in the current node group will be equal to the minimum number of nodes, and the number of nodes can be automatically adjusted within the set range. The specific adjustment mechanism is as follows:

Mechanism	Description
Automatic addition	When there are unschedulable pods in the VM-based workload cluster and new nodes can be added to allow the pods to be schedulable, the system will automatically create new nodes that meet the requirements of those pods. If GPU devices are configured for the node group, and the number of available GPU devices in the cluster is less than the requested amount while there are still mountable GPU devices of the same type on the host, the system will automatically create new nodes and mount GPU devices to them.
Automatic reduction	The triggering of this behavior depends on whether GPU devices are configured within the node group. If GPU devices are not configured, when the maximum ratio of CPU or memory requests of a worker node in the worker cluster is less than 50% for a continuous period of 10 minutes and the pods on that node can be evicted, the system will automatically evict the pods and delete the node. CPU request ratio = sum of CPU request quantities / total CPU quantities. Memory request ratio = sum of memory request quantities / total memory quantities. If GPU devices are configured, only when the ratio of GPU device requests is less than 50% for a continuous period of 10 minutes and when the pods on that node can be evicted, the system will automatically evict the pods and delete the node. The request ratio of the GPU device = sum of GPU device request quantities / total GPU device quantity.

Mechanism

Description

Automatic addition

When there are unschedulable pods in the VM-based workload cluster and new nodes can be added to allow the pods to be schedulable, the system will automatically create new nodes that meet the requirements of those pods.
If GPU devices are configured for the node group, and the number of available GPU devices in the cluster is less than the requested amount while there are still mountable GPU devices of the same type on the host, the system will automatically create new nodes and mount GPU devices to them.

Automatic reduction

The triggering of this behavior depends on whether GPU devices are configured within the node group.

If GPU devices are not configured, when the maximum ratio of CPU or memory requests of a worker node in the worker cluster is less than 50% for a continuous period of 10 minutes and the pods on that node can be evicted, the system will automatically evict the pods and delete the node. CPU request ratio = sum of CPU request quantities / total CPU quantities. Memory request ratio = sum of memory request quantities / total memory quantities.
If GPU devices are configured, only when the ratio of GPU device requests is less than 50% for a continuous period of 10 minutes and when the pods on that node can be evicted, the system will automatically evict the pods and delete the node. The request ratio of the GPU device = sum of GPU device request quantities / total GPU device quantity.

Configure the resources for each node within the node group.

Parameter	Description
CPU	The number of vCPUs allocated per node in the group. The default is 4 vCPUs, with a minimum of 4 vCPUs.
Memory	The amount of memory allocated per node in the group. The default is 8 GiB, with a minimum of 8 GiB.
GPU (displayed only when the ACOS cluster where the VM-based workload cluster resides has hosts with mounted GPU devices)	GPU configuration per node in the group. Not configured by default.If you need to configure GPU devices, choose "Passthrough" or "vGPU" based on the information planned when confirming the GPU device requirements for the workload cluster and then set the model and quantity of passthrough GPU devices or vGPU for each node. Note: In the GPU model drop-down list, hover over the information icon "i" next to the available quantity to view the number of available GPUs on each host.
Storage	The amount of storage allocated per node in the group, which is the disk allocation capacity for the corresponding virtual machine. The default is 200 GiB, with a minimum of 200 GiB.

(Optional) Enable Faulty node auto replacement. Once enabled, when the number of faulty nodes in the node group meets the limit conditions for the number of faulty nodes, the system will automatically delete the faulty nodes and create new ones. Enabling this feature requires configuring the following parameters:

Parameter	Description
Faulty node detection	Conditions for determining faulty nodes. You should tick the required conditions and set the duration threshold. Note: If you need to set the duration threshold for nodes in the Not Ready or in the Unknown state, to avoid virtual machines that have already recovered from faults through high availability (HA) being identified as faulty nodes, it is recommended to set the threshold to no less than 5 minutes. This is because the virtual machine HA feature requires a certain recovery time (130 seconds), and the node also needs some time to become available again. If you need to set the duration threshold for nodes not started, to avoid nodes being identified as faulty nodes before they start, it is recommended to set the threshold to no less than 5 minutes.
Faulty node quantity limit	Limit conditions for the number of faulty nodes that can trigger the system to perform node replacement. You need to choose one of the following dimensions for limitation: Faulty node percentage: The maximum percentage of faulty nodes in the node group that can be automatically replaced. If the number of faulty nodes exceeds this percentage, automatic replacement is not supported. You need to set a percentage. For example, if the total number of nodes in the group is 6, and this setting is 40%, then when the number of faulty nodes in the group is less than or equal to 2, they will be automatically replaced; if it is greater than 2, they will not be replaced. Faulty node count range: The range of faulty node counts that can trigger the system to perform node replacement. You need to input the minimum and maximum number of faulty nodes to specify the range. For example, if the total number of nodes in the group is 10, and the configured range is 6-8, then when the number of faulty nodes in the group is 6, 7 or 8, they will be automatically replaced; otherwise, they will not be replaced.

Parameter

Description

Faulty node detection

Conditions for determining faulty nodes. You should tick the required conditions and set the duration threshold.

Note:

If you need to set the duration threshold for nodes in the Not Ready or in the Unknown state, to avoid virtual machines that have already recovered from faults through high availability (HA) being identified as faulty nodes, it is recommended to set the threshold to no less than 5 minutes. This is because the virtual machine HA feature requires a certain recovery time (130 seconds), and the node also needs some time to become available again.
If you need to set the duration threshold for nodes not started, to avoid nodes being identified as faulty nodes before they start, it is recommended to set the threshold to no less than 5 minutes.

Faulty node quantity limit

Limit conditions for the number of faulty nodes that can trigger the system to perform node replacement. You need to choose one of the following dimensions for limitation:

Faulty node percentage: The maximum percentage of faulty nodes in the node group that can be automatically replaced. If the number of faulty nodes exceeds this percentage, automatic replacement is not supported. You need to set a percentage. For example, if the total number of nodes in the group is 6, and this setting is 40%, then when the number of faulty nodes in the group is less than or equal to 2, they will be automatically replaced; if it is greater than 2, they will not be replaced.
Faulty node count range: The range of faulty node counts that can trigger the system to perform node replacement. You need to input the minimum and maximum number of faulty nodes to specify the range. For example, if the total number of nodes in the group is 10, and the configured range is 6-8, then when the number of faulty nodes in the group is 6, 7 or 8, they will be automatically replaced; otherwise, they will not be replaced.

Node access configuration

You need to configure the default account password or SSH public key for accessing nodes in the VM-based workload cluster.

Default account password: Enter the password for the default account admin. Leave blank if not configuring. Note that you need to enter your password twice to confirm it.
SSH public key: Enter the SSH public key. You can manually input the key or extract it from a file. Leave blank if not configuring.

In this article

Node group autoscaling
Control plane node group
Worker node group
Node access configuration