Configuring nodes

The VM-based workload cluster contains a control plane node group and at least one worker node group, with nodes in the same group having the same configurations.

Node group autoscaling

If the VM-based workload cluster being created requires a worker node group with the Autoscaling node count type, you should enable Node group autoscaling to set the auto scaling range for the worker node group in subsequent steps.

Note:

You can also enable or disable Node group autoscaling after the VM-based workload cluster is created.

Control plane nodes

Set the node group name.
Specify the number of nodes for the node group. You can choose 1, 3, or 5. To ensure control plane high availability for the VM-based workload cluster, it is recommended to select 3 or 5.

Configure the resources for each node within the node group.

Parameter	Description
CPU	The number of vCPUs allocated per node in the group. The default is 4 vCPUs, with a minimum allocation of 2 vCPUs.
Memory	The amount of memory allocated per node in the group. The default is 8 GiB, with a minimum allocation of 6 GiB.
Storage	The amount of storage allocated per node in the group, which is the disk capacity for the corresponding virtual machine. The default is 200 GiB and cannot be modified.

(Optional) If the number of nodes in the node group is greater than 1, you can choose to enable Faulty node auto replacement. Once enabled, if the proportion of faulty nodes in the node group does not exceed the maximum threshold for faulty nodes, the system will automatically remove the faulty nodes and create new ones. Enabling this feature requires configuring the following parameters.

Parameter	Description
Faulty node detection	Criteria for determining faulty nodes. You should tick the required conditions and set the duration threshold. Notes: If you need to set the duration threshold for nodes in the Not Ready or in the Unknown state, to avoid virtual machines that have already recovered from faults through high availability (HA) being identified as faulty nodes, it is recommended to set the threshold to no less than 5 minutes. This is because the virtual machine high availability feature requires a certain recovery time (130 seconds), and the node also needs some time to become available again. If you need to set the duration threshold for nodes not started, to avoid nodes being identified as faulty nodes before they start, it is recommended to set the threshold to no less than 5 minutes.
Maximum percentage of faulty nodes	Setting it can trigger the system to perform automatic replacements of faulty nodes if the percentage of faulty nodes is within the maximum percentage. The value cannot be greater than 40%. Assuming the total number of nodes in the group is 5, if this setting is 40%, then when the number of faulty nodes in the group is less than or equal to 2, they will be automatically replaced; if it is greater than 2, they will not be replaced.

Parameter

Description

Faulty node detection

Criteria for determining faulty nodes. You should tick the required conditions and set the duration threshold.

Notes:

If you need to set the duration threshold for nodes in the Not Ready or in the Unknown state, to avoid virtual machines that have already recovered from faults through high availability (HA) being identified as faulty nodes, it is recommended to set the threshold to no less than 5 minutes. This is because the virtual machine high availability feature requires a certain recovery time (130 seconds), and the node also needs some time to become available again.
If you need to set the duration threshold for nodes not started, to avoid nodes being identified as faulty nodes before they start, it is recommended to set the threshold to no less than 5 minutes.

Maximum percentage of faulty nodes

Setting it can trigger the system to perform automatic replacements of faulty nodes if the percentage of faulty nodes is within the maximum percentage. The value cannot be greater than 40%. Assuming the total number of nodes in the group is 5, if this setting is 40%, then when the number of faulty nodes in the group is less than or equal to 2, they will be automatically replaced; if it is greater than 2, they will not be replaced.

Worker node group

You need to create at least one worker node group.

Set the node group name.

Specify the number of nodes for the node group.

Fixed count: Enter the number of nodes, with a minimum of 1.

Autoscaling (Only supported when Node group autoscaling is enabled): Enter the minimum and maximum numbers of nodes to specify the node count range. The minimum number of nodes cannot be less than 3. After the cluster is created, the initial number of nodes in the current node group will be equal to the minimum number of nodes, and the number of nodes can be automatically adjusted within the set range. The specific adjustment mechanism is as follows:

Mechanism	Description
Automatic addition	When there are unschedulable pods in the VM-based workload cluster and new nodes can be added to allow the pods to be schedulabled, the system will automatically create new nodes that meet the requirements of those pods. If the node group is configured with GPU devices, when the available quantity of GPU devices in the cluster is less than the requested quantity, and there are still GPU devices that can be mounted on the host, the system will automatically create new nodes and mount GPU devices on the new nodes.
Automatic reduction	The triggering of this behavior depends on whether GPU devices are configured within the node group. If GPU devices are not configured, when the maximum ratio of CPU or memory requests of a worker node in the worker cluster is less than 50% for a continuous period of 10 minutes and the pods on that node can be evicted, the system will automatically evict the pods and delete the node. CPU request ratio = sum of CPU request quantities/total CPU quantities. Memory request ratio=sum of memory request quantities/total memory quantities. If GPU devices are configured, only when the ratio of GPU device requests is less than 50% for a continuous period of 10 minutes and when the pods on that node can be evicted, will the system automatically evict the pods and delete the node. The request ratio of the GPU device = sum of GPU device request quantities/total GPU device quantity.

Mechanism

Description

Automatic addition

When there are unschedulable pods in the VM-based workload cluster and new nodes can be added to allow the pods to be schedulabled, the system will automatically create new nodes that meet the requirements of those pods.
If the node group is configured with GPU devices, when the available quantity of GPU devices in the cluster is less than the requested quantity, and there are still GPU devices that can be mounted on the host, the system will automatically create new nodes and mount GPU devices on the new nodes.

Automatic reduction

The triggering of this behavior depends on whether GPU devices are configured within the node group.

If GPU devices are not configured, when the maximum ratio of CPU or memory requests of a worker node in the worker cluster is less than 50% for a continuous period of 10 minutes and the pods on that node can be evicted, the system will automatically evict the pods and delete the node. CPU request ratio = sum of CPU request quantities/total CPU quantities. Memory request ratio=sum of memory request quantities/total memory quantities.
If GPU devices are configured, only when the ratio of GPU device requests is less than 50% for a continuous period of 10 minutes and when the pods on that node can be evicted, will the system automatically evict the pods and delete the node. The request ratio of the GPU device = sum of GPU device request quantities/total GPU device quantity.

Configure the resources for each node within the node group.

Parameter	Description
CPU	The number of vCPUs allocated per node in the group. The default is 4 vCPUs, with a minimum of 4 vCPUs.
Memory	The amount of memory allocated per node in the group. The default is 8 GiB, with a minimum of 8 GiB.
GPU (Only shows when the hosts in the ACOS cluster where workload clusters resides have available GPU devices)	GPU configuration per node in the group, default is not configured. If you need to configure GPU devices, choose `Passthrough` or `vGPU` based on the information planned when confirming the use of GPU devices in Requirements for using GPU devices and then set the model and quantity of passthrough GPU devices or vGPU for each node.
Storage	The amount of storage allocated per node in the group, which is the disk allocation capacity for the corresponding virtual machine. Default is 200 GiB and cannot be modified.

(Optional) Enable Faulty node auto replacement. Once enabled, when the number of faulty nodes in the node group meets the limit conditions for the number of faulty nodes, the system will automatically delete the faulty nodes and create new ones. Enabling this feature requires setting the following parameters:

Parameter	Description
Faulty node detection	Conditions for determining node failure. Select the required conditions and set the duration threshold. Notes: If you need to set the duration threshold for nodes in Not Ready or Unknown status, to avoid virtual machines that have already been recovered through high availability (HA) still being identified as faulty nodes, it is recommended to set the threshold to no less than 5 minutes. This is because the virtual machine high availability feature requires a certain fault recovery time (130 seconds), and the node also needs some time to become available again. If you need to set the duration threshold for nodes that have not started, to avoid nodes being identified as faulty before they finish starting up, it is recommended to set the threshold to no less than 5 minutes.
Faulty node count limit	Limit conditions for the number of faulty nodes that can trigger the system to perform node replacement. You need to choose one of the following dimensions for limitation: Faulty node percentage: The maximum percentage of faulty nodes in the node group that can be automatically replaced. If the number of faulty nodes exceeds this percentage, automatic replacement is not supported. You need to set a percentage. For example, if the total number of nodes in the node group is 6, and this item is set to 40%, then when the number of faulty nodes in the node group is less than or equal to 2, they will be automatically replaced; when it's greater than 2, they will not be replaced. Faulty node count range: The range of faulty node counts that can trigger the system to perform node replacement. You need to input the minimum and maximum number of faulty nodes to specify the range. For example, if the total number of nodes in the node group is 10, and the range set here is 6 to 8, then only when the number of faulty nodes in the node group is 6, 7, or 8 will the faulty nodes be automatically replaced; otherwise, they will not be replaced.

Parameter

Description

Faulty node detection

Conditions for determining node failure. Select the required conditions and set the duration threshold.

Notes:

If you need to set the duration threshold for nodes in Not Ready or Unknown status, to avoid virtual machines that have already been recovered through high availability (HA) still being identified as faulty nodes, it is recommended to set the threshold to no less than 5 minutes. This is because the virtual machine high availability feature requires a certain fault recovery time (130 seconds), and the node also needs some time to become available again.
If you need to set the duration threshold for nodes that have not started, to avoid nodes being identified as faulty before they finish starting up, it is recommended to set the threshold to no less than 5 minutes.

Faulty node count limit

Limit conditions for the number of faulty nodes that can trigger the system to perform node replacement. You need to choose one of the following dimensions for limitation:

Faulty node percentage: The maximum percentage of faulty nodes in the node group that can be automatically replaced. If the number of faulty nodes exceeds this percentage, automatic replacement is not supported. You need to set a percentage. For example, if the total number of nodes in the node group is 6, and this item is set to 40%, then when the number of faulty nodes in the node group is less than or equal to 2, they will be automatically replaced; when it's greater than 2, they will not be replaced.
Faulty node count range: The range of faulty node counts that can trigger the system to perform node replacement. You need to input the minimum and maximum number of faulty nodes to specify the range. For example, if the total number of nodes in the node group is 10, and the range set here is 6 to 8, then only when the number of faulty nodes in the node group is 6, 7, or 8 will the faulty nodes be automatically replaced; otherwise, they will not be replaced.

Node access configuration

You need to configure the default account password or SSH public key for accessing nodes in the VM-based workload cluster.

Default account password: Enter the password for the default account admin, leave blank if not configuring. You need to enter the password twice to confirm it.
SSH public key: Enter the SSH public key. You can manually input or extract from a file. Leave it blank if you don't want to configure.

In this article

Node group autoscaling
Control plane nodes
Worker node group
Node access configuration