ACOS 6.3.0

Release notes>
Arcfra Cloud Operating System

What's in this release

What's new

Virtualization

Virtual machine high availability (HA)
- Supports triggering alarms on virtual machine HA success or failure.
- Supports configuring the HA operation as Hot migrate VM or Rebuild VM for virtual machine network failure scenarios.
- Supports enabling HA for virtual machines that have SR-IOV passthrough NICs or vGPUs mounted.
Supports configuring the queue count for virtual NICs.
Supports enabling IO Threads and configuring disk parameters for virtual machines via the command line.
Supports batch selecting virtual machines and upgrading their Arcfra VMTools in one click.
Supports editing the name and description of virtual machine snapshots.
Supports configuring full-copy cloning speed limit for a cluster.
Supports enabling hot migration encryption for the cluster.
Supports creating placement group policies at the availability zone level.

Block storage

Supports configuring multiple physical disk pools on a single node, with each pool corresponding to a Chunk instance to enhance the storage performance limit of a single node.
The recycle bin now supports NFS files and iSCSI LUNs.
Supports access control to iSCSI LUNs using client hosts or client host groups.
Supports performing encryption at rest for volume data using the built-in key management service.
Supports enabling the encryption acceleration feature to enhance the performance of encrypted volumes.

Networking

Supports traffic mirroring, enabling the duplication and forwarding of traffic from a specified network to a remote analysis platform via an ERSPAN tunnel. Supports flexible selection of a dedicated mirroring egress network or reusing the management network as the tunnel source.
A virtual distributed switch supports IGMP/MLD snooping. When enabled, the virtual distributed switch listens to IGMP/MLD packets and dynamically maintains multicast group members. This ensures precise forwarding of multicast traffic, reducing network load.
When RDMA is enabled for the storage network, you can select and bond multiple network ports across NICs within the associated virtual distributed switch.
When the storage network enables RDMA, it supports automatic flow control configuration after restarting the node's Chunk service or performing an ifdown operation on the relevant physical network port.
The storage network supports enabling or disabling RDMA online.
By adjusting certain network alerts' threshold parameters, shortening detection intervals, and reducing trigger latency, the sensitivity and timeliness of alert triggering have been improved.
Supports configuring the vnet NIC queue length via the command line.

Operations and management

In tiered mode, when all nodes use the same type of SSDs (NVMe SSDs or SATA/SAS SSDs), the physical disk partitioning method can be determined by modifying the system disk purpose.
- When the system disk purpose is set to "Data disk with metadata partition": each physical disk includes both a cache partition and a data partition. This configuration is suitable for scenarios where all physical disks are of the same SSD type and have identical attributes.
- When the system disk purpose is set to "Cache disk with metadata partition": cache and data disks are deployed separately. This configuration is suitable for scenarios where all physical disks are of the same SSD type but have different attributes.
Supports viewing the NUMA node ID of the physical disk and network port.
L2 ping supports maintenance mode.
Supports unified modification of the SSH service port on all hosts in a cluster.
Supports forcibly synchronizing the cluster system time to the NTP server time via the UI and provides a clearer error message for the NTP server issue.
Support modifying the cluster timezone via the command line.
Supports configuring the ACOS network service to either exclusively use CPU resources or share them with system services through the command line.
Boost mode is enabled by default during ACOS (AVE) cluster installation and deployment.
Supports GPG signature verification for ACOS RPM, enhancing the integrity and security of system software packages.
Adds an alert for the memory pressure event in a memory group.
Adds an alert for excessive load of network port ingress and egress traffic.
Adds monitoring and display of the status of member physical disks in hardware RAID.

Kernel

Kernel version has been upgraded to 5.10.0-247.0.0.
Adds an alert for host TSC calibration failure.
The MegaRAID driver has been upgraded to version 07.735.03.00.

Improvements

Virtualization

Optimizes cross-cluster cold migration, cross-cluster staged migration, and virtual machine cloning mechanisms so that the PCI addresses of virtual NICs remain unchanged after migration or cloning.
Optimizes the cross-cluster hot migration mechanism in non-Boost mode by transferring only valid data during migration, thus reducing migration time.
Optimizes the HA mechanism: In certain failure scenarios, if a virtual machine was suspended and could not be recovered, the system would first forcibly shut down the virtual machine, and then perform local HA recovery.
Improves the ISO upload mechanism to reduce the possibility of task timeouts.
Optimizes the service leader election mechanism to no longer rely on the ZBS Meta Leader, improving virtual machine HA service handling in abnormal scenarios.

Block storage

The following improvements have been made to enhance storage performance:
- Increases the number of data channels between nodes. When using balance-tcp mode for network port bonding on a VDS associated with the storage network, multiple network ports associated with the storage network can be fully utilized.
- Configures interrupts for the network ports corresponding to the storage network and access network.
- Optimizes the I/O handling logic for LSM and Access.
- Leverages Intel DSA for hardware acceleration.
- IO Uring is enabled by default to reduce LSM I/O resource consumption.
- Optimizes the instantaneous I/O drop of virtual volumes after snapshot creation.
- The virtual volume defaults to using 8 stripes.
Optimizes data migration and recovery strategies:
- Node removal will automatically stop when the cluster's space load is high.
- Migration tasks are generated separately for the capacity tier and the performance tier, improving overall efficiency when large volumes of data need to be migrated.
- During node removal, if any Chunk instance enters an extremely high load state, capacity balancing is prioritized to ensure the node removal completes as quickly as possible.
- Optimizes the source node selection strategy during data migration and recovery to improve migration and recovery speed for sparse volumes.
- When unmounting a physical disk, data is preferentially read from other healthy disks containing data replicas to accelerate data migration, thereby reducing operation time.
- The number of parallel recovery or migration tasks on a node can now be modified via configuration files.
- Refines the data tiering strategy: when the performance tier is under high load, relatively cold data is sunk earlier to ensure frequently accessed data is more likely to remain in the performance tier.
Optimizes the throttling mechanism for snapshot and clone operations on virtual volumes with large amounts of data in the performance tier, allowing throttling to ramp up smoothly and recover gradually as performance-tier space grows rapidly, thereby avoiding sudden I/O drops.
Improves the total capacity upper limit of the cache partition for a single node. The total cache partition capacity upper limit for a single physical disk pool is 51 TiB.
Increases the cache ratio upper limit that a single node can reserve for pinned volumes to 75%.
Allows pinned volumes to use thin provisioning; when thin provisioning is used, it does not occupy capacity-layer space.
Closing volume pinning is not allowed for volumes when the write cache capacity is high.
The minimum configuration for an active-active cluster adjusts to at least 2 nodes in each availability zone.
The timeout for the storage maintenance mode is extended to 7 days.
Allows manually relocating specified data blocks via the command line, provided cluster space allocation rules are met.
By default, the I/O cache module, which is no longer needed in the new storage architecture, is disabled.
When the cluster cache load is high, continuous large blocks of data are more reliably written directly to the capacity layer.

Networking

Supports migrating the default VM network to another virtual distributed switch via the network-tool.
Adds a pre-alert for high usage of the host conntrack table and a real-time alert for a full conntrack table, notifying users to handle situations with an excessive number of connections to avoid impacts on network communication within the host.

Operations and management

Supports automatically isolating high-latency SSDs when conditions are secure.
Modifies UDEV rules to ensure the uniqueness of physical disk ID links.
Adjusts timemachine-related alert rules by removing rules that do not apply to snapshot plans and modifying the descriptions of some rules.
Adds an upgrade pre-check item to prevent clusters with expired licenses from being upgraded.
Optimizes the process for enabling the SR-IOV feature on Mellanox NICs, reducing the number of times the host needs to restart.
Optimizes the fluent-bit service status check logic to prevent cluster upgrade failures caused by failed checks.

Resolved issues

Virtualization

During virtual machine hot migration, if the source host restarted the libvirtd service due to maintenance or a failure, it might have caused the hot migration and subsequent virtual machine operations to fail. The issue has been resolved in this release.
During virtual machine hot migration, continuously imposing high-level limits on vCPU performance might have caused virtual machine service abnormalities. The issue has been resolved in this release.
During cross-cluster hot migration, files on the destination virtual machine could be mistakenly deleted by a scheduled cleanup task, causing the migration to fail. The issue has been fixed in this release.
After performing cross-cluster virtual machine migration, the virtual machines might have failed to be migrated back to the source cluster due to the scheduled resource deletion tasks with long cycles. The issue has been resolved in this release.
In rare cases, a virtual machine task that was triggered but not fully initialized before a node failure might have caused the virtual machine HA to fail. The issue has been fixed in this release.
In AMD x86_64 clusters, the CPU compatibility configuration list might have unexpectedly included the Dhyana option. The issue has been resolved in this release.
After creating a virtual machine from a VM template imported with an OVF file, hot migrating that virtual machine across clusters might have failed due to missing fields. The issue has been resolved in this release.
After failing to access the ZBS service with disk lock, libvirt might have unexpectedly restarted due to the release of uninitialized memory. The issue has been resolved in this release.
After cold migrating a CPU-exclusive virtual machine to a heterogeneous node, moving it to the recycle bin might have failed. The issue has been resolved in this release.
Editing the NIC IP of Ubuntu or Debian virtual machines via Arcfra VMTools might have altered the virtual machine's existing DNS configuration. The issue has been resolved in this release.
Virtual machine cross-cluster hot migration in Boost mode might have caused disk I/O limits to become ineffective. The issue has been resolved in this release.
After virtual machine startup, adding a new NIC might have failed if an existing DHCP NIC did not connect automatically. The issue has been resolved in this release.
After configuring DNS servers for a virtual machine via Arcfra VMTools, a virtual machine reboot might have caused the gateway configuration not to take effect. The issue has been resolved in this release.
When Boost mode was not enabled, cross-cluster hot migration might have reported inaccurate progress. The issue has been resolved in this release.
After changing the VM network allocated by the virtual machine NIC to another VM network associated with the same VDS, virtual machine hot migration might have caused ANS features applied to the virtual machine—such as speed limiting and traffic mirroring—to become ineffective. The issue has been resolved in this release.
Configuring a static IP for a virtual machine via Arcfra VMTools might have failed if the operating system's nmcli tool version was outdated. The issue has been resolved in this release.
Cross-cluster hot migration might have failed when a virtual machine had volumes with both volume pinning and data-at-rest encryption enabled mounted. The issue has been resolved in this release.
In tiered storage mode, the logical disk space displayed in virtual machine monitoring was incorrect. The issue has been resolved in this release.
After performing a one-click replica factor increase on a cluster, creating virtual machines via fast copy from templates cloned from the virtual machine within that cluster might have failed. The issue has been resolved in this release.
After configuring a static IP address for a virtual machine with multiple NICs via Arcfra VMTools, the IP address might not match the expected one after a virtual machine reboot. The issue has been resolved in this release.
After editing the DNS server address using Arcfra VMTools, configuring the short domain name might have failed because certain configuration items were mistakenly cleared. The issue has been resolved in this release.
After configuring DNS on a Linux virtual machine via Arcfra VMTools, the NetworkManager service might have started unexpectedly. The issue has been resolved in this release.
Editing the NIC IP address might have failed if the HWADDR in the VM network configuration file did not match the NIC MAC address. The issue has been resolved in this release.
Modifying certain cluster configurations during virtual machine cross-cluster hot migration might cause the virtual machines to enter an unexpected state. The issue has been resolved in this release.
For virtual machines with no VNC connections, the QEMU USB 1.0 emulation optimization might have caused degradation to their CD-ROM read performance. The issue has been resolved in this release.
Under certain network failure conditions, virtual machine cross-cluster hot migration might hang for an extended period and could be canceled. The issue has been resolved in this release.
After enabling Windows optimization for a Windows virtual machine, the virtual machine might hang during restart due to an incorrect TSC reset. The issue has been resolved in this release.
In clusters with insufficient resources, after a quick faulty host recovery, virtual machines on that host might not be automatically restored via HA. The issue has been resolved in this release.
Migrating a virtual machine while the elf-vm-monitor service on the host was not running could cause an unexpected virtual machine HA rebuild. The issue has been resolved in this release.
Arcfra VMTools might not function properly after encountering a command execution exception. The issue has been resolved in this release.
When a virtual machine in a cluster failed to meet placement group policies and triggered an alert, even migrating that virtual machine to meet the requirements might not have automatically resolved that alert, or could trigger repeated false alerts. The issue has been resolved in this release.

Block storage

When the storage network had RDMA enabled, storage services might have failed to start if a NIC became abnormal. The issue has been resolved in this release.
When a cluster used ZooKeeper version v3.9.9-27 or earlier, the zookeeper service might have become abnormal, or the master node might have failed to connect to the Meta Leader. The issue has been resolved in this release.
The iSCSI link might have repeatedly disconnected and reconnected. The issue has been resolved in this release.
The data_channel_bench.py script might have intermittently produced no results for some test cases. The issue has been resolved in this release.
In clusters with all-flash configuration, the occasional slowness on physical disks might have impacted service I/O. The issue has been mitigated by dynamically adjusting the I/O timeout for physical disks.
When executing the XCOPY command across iSCSI targets, a connection interruption might have caused the Chunk service to crash. The issue has been resolved in this release.
When ESXi formatted an iSCSI LUN as VMFS, virtual machines running on it might have encountered errors when using fstrim or blkdiscard command to reclaim space. The issue has been resolved in this release.
Throttles external client I/O to prevent Chunk service failures.
Intermittent high I/O latency occurred when the cluster had no business I/O load. The issue has been resolved in this release.
The number of automatically created snapshot groups retained by the snapshot plan exceeded the configured maximum retention limit. The issue has been resolved in this release.
An unauthorized interface vulnerability was detected during vulnerability scanning. The issue has been resolved in this release.
The zookeeper service might have failed to exit when it encountered rare exceptions. The issue has been resolved in this release.

Networking

The network port isolation status might not have matched expectations. The issue has been resolved in this release.
When port bonding mode was set to balance-tcp, the port connectivity test might have reported abnormal results. The issue has been resolved in this release.
Frequent network port abnormal alerts occurred due to instability in the health check service. The issue has been resolved in this release.
The system reported incorrect packet loss rate in monitoring charts. The issue has been resolved by adjusting the packet loss rate calculation method.

Operations and management

The ntpd service might not have started due to time rollback after the systemd configuration file was generated or the host was rebooted. The issue has been resolved in this release.
After the cluster was expanded, subsequent upgrades failed if the upgrade tool version on the new nodes differed from that on the existing nodes. The issue has been resolved in this release.
During port scanning, a surge in ntpm HTTP metrics might have caused monitoring and alert service failures; the collection of relevant metrics was disabled by default. The issue has been resolved in this release.
Cached data in the browser might have caused unexpected behavior during cluster deployment. The issue has been resolved in this release.
Some fans on the node had a significantly different speed compared to others. The issue has been resolved in this release.
In UEFI boot mode, newly added kernel parameters might have failed to take effect during cluster upgrade due to inaccurate boot disk rebuild directories. The issue has been resolved in this release.
The removable type physical disks could not be selected when installing an ISO on a host. The issue has been resolved in this release.
The threshold for the NTP server's maximum round-trip delay was set too low, causing false alerts indicating that the host could not synchronize time with the NTP server. The issue has been resolved in this release.
The log rotation configuration was written to an incorrect path, causing the logs to fail to rotate. The issue has been resolved in this release.
Cluster upgrade might have failed after rebuilding the witness node in an active-active cluster. The issue has been resolved in this release.
Historical monitoring chart data could not be displayed after a cluster upgrade. The issue has been resolved in this release.
Scheduled tasks failed due to root account expiration, leading to log rotation failures and disk full issues. The issue has been resolved in this release.
Scaling up a cluster failed due to cgroup check failure. The issue has been resolved in this release.
After mounting a USB device to a virtual machine migrated from a host entering maintenance mode, automatically migrating it back to that source host when the host exited maintenance mode failed. The issue has been resolved in this release.
After forcibly shutting down a node in a ACOS (AVE) cluster, the alert for node network unreachability was triggered with excessive delay. The issue has been resolved in this release.
When collected logs on a node exceeded 3 GB, a surge of log cleanup tasks was triggered within a short period of time, consuming excessive CPU and memory resources in system services and resulting in false alerts. The issue has been resolved in this release.
When the cluster log file size exceeded 7 GB, log files downloaded from some nodes might have failed to be decompressed. The issue has been resolved in this release.
When a cluster NTP server used an unresolvable domain name, an exception alert might not have been triggered correctly. The issue has been resolved in this release.
MongoDB log files might have consumed abnormal disk space due to system time jumps on a node. The issue has been resolved in this release.
Stopping or restarting the network service under unexpected network port configurations might have cleared the DNS settings. The issue has been resolved in this release.

Kernel

LLDP functionality on Intel X710 NIC was not working. The issue has been resolved in this release.
When the system root file system (rootfs) backend storage was a single block device, the SCSI disk offline service might have erroneously triggered device offline events. The issue has been resolved in this release.

In this article

What's new
Improvements
Resolved issues