AECP Version

Understanding Arcfra Operation Center
Logging in to Arcfra Operation Center
Managing data centers
Managing Arcfra Cloud Operating System clusters
Managing virtual machines
Managing the content library
Managing virtual volumes
Managing network and security
Managing file storage
Managing backup and disaster recovery
- Managing backup services
- Managing replication services
  - Basic concepts
  - Planning and preparations
  - Managing site association (cross-site replication only)
  - Managing cluster association
  - Managing replication plans
  - Disaster recovery
  - Managing replication objects
  - Managing restore points
  - Viewing the virtual machine replication overview
Managing the Kubernetes service
Managing observability
Managing alerts
Performing global search
Managing reports
Managing labels
Arcfra Operation Center system configurations
Arcfra Operation Center login and account security
Arcfra Operation Center users and permissions
Event auditing
The task center

ABDR 2.1.0

Managing backup and disaster recovery>
Managing replication services

Basic concepts

This chapter explains the technical terms involved in the replication service. If you encounter any unfamiliar concepts while reading, you can look up their definitions and detailed explanations here.

Replication service

The replication service is responsible for managing replication plans and performing jobs such as replication, failover, failback, and permanent failover for virtual machines.

Depending on the associated cluster, the replication service is divided into the target replication service and the source replication service.

Target replication service

The replication service associated with the target cluster, which is responsible for executing replication plans, failover, managing replication resources, and reading data from the target cluster and transmitting it to the source replication service during failover. The target replication service is responsible for ensuring the synchronization of replicated data, guaranteeing data consistency between replica virtual machines and original virtual machines.
Source replication service

The replication service associated with the source cluster, which is responsible for reading virtual machine data from the source cluster and transmitting it to the target replication service. During failback, it receives virtual machine data from the target replication service and writes it to the source cluster.

Site

A site refers to a failure domain managed by a single AOC, which can include one or more clusters and data centers. The resources of each site are managed by its dedicated AOC and are completely isolated in terms of geographical location, hardware devices, and software services, without sharing any resources with other sites.

Sites can be divided into the following two types:

Source site: The source site is usually the site where the business is actually running. The replication service replicates virtual machines from the source site to replica virtual machines at the target site to achieve data synchronization.
Target site: The target site is typically used for disaster recovery, storing replica virtual machines replicated from the source site.

Replication plan types

Intra-site replication

The source cluster and target cluster are located in the same site. When the original virtual machine is unavailable, it can failover to the replica virtual machine in other clusters within the site to achieve cluster-level disaster recovery. If the AOC managing the site fails, failover will not be possible.
Cross-site replication

The source cluster and target cluster are located in different sites. When the original virtual machine in the source site is unavailable, regardless of whether the AOC managing the source site has failed, it is possible to failover to replica virtual machine at the target site, achieving site-level disaster recovery.
Cross-site replication includes the following two types:
- Incoming replication: The current site is the target site, and the source site is another site.
- Outgoing replication: The current site is the source site, and the target site is another site.

Source cluster

Typically a production cluster. It is the cluster to which the original virtual machine belongs, hosting actual applications and data, and located at the source site.

Target cluster

Typically a disaster recovery cluster. It is a cluster that stores replica virtual machines, located at the target site.

Virtual machine

Implemented through virtualization technology, it is a computer system running on a physical server's virtualization platform, with characteristics identical to a physical computer. All virtual machines on a physical server share the underlying physical hardware resources, but each virtual machine has its own independent operating system, applications, and resource configuration. Virtual machines are isolated from each other, running without affecting one another, and can be used for testing, development, and production environments. Virtual machines are isolated from each other, running without affecting one another, and can be used for testing, development, and production environments.

When replicating virtual machines, two virtual machine roles are involved:

Original virtual machine

Virtual machines that need disaster recovery protection, responsible for hosting actual applications and data.
Replica virtual machine

A complete copy of the original virtual machine created through the replication plan, with the same data and configuration as the original virtual machine. Replica virtual machines are typically used for disaster recovery. In the event of a failure of the original virtual machine, the replica virtual machine can be swiftly started and take over the responsibilities of the original one.

Replication resources

Replication plans, replication objects, and restore points.

Replication object

An original virtual machine and its replica virtual machine are called a set of replication objects.

Replication plan

Replication plans are used to plan and schedule data replication operations. Through replication plans, users can define information such as replication objects, replication type, replication site, target cluster, replication service, replication schedule, replication window, retention policy, replica objects, and more.

Replication schedule

The frequency and time at which the replication plan automatically executes, used to specify the time interval for data replication. A shorter replication schedule allows for a smaller RPO (Recovery Point Objective). For example, if the replication schedule is set to 15 minutes, the maximum potential data loss is limited to 15 minutes in case of a disaster.

Restore point

A restore point in the replication and recovery function is a specific point in time from which data can be restored. It is a snapshot created during replication, failover (in immediate restore point generation mode), or failback and saved on the replica virtual machine.

Restore points allow replica virtual machines to quickly roll back to a specific data state when needed, for example, restoring a virtual machine to a previous point in time in a disaster recovery scenario to resume business operations. By generating restore points regularly, it ensures that the data of the replica virtual machine is up-to-date.

Replication window

The time range during which the replication plan is allowed to execute automatically.

Asynchronous replication

The replication method used by the replication service to replicate data from the source to the destination. In asynchronous replication, data from the source is not transmitted to the destination in real-time, but rather data transmission occurs periodically.

Full replication

Replicates all data of the virtual machine, including the operating system, applications, files, and configurations, to the replica virtual machine, ensuring that the replica virtual machine is identical to the original virtual machine. Full replication typically occurs during the initial replication to establish the initial copy of the virtual machine and provide a baseline for subsequent incremental replications.

Incremental replication

Replicates data that has changed since the last replication, continuously tracking and synchronizing changes from the data source to ensure data accuracy and consistency.

Disaster Recovery

Recovery operations that can be performed after a failure occurs, including failover, failback, and permanent failover.

Failover

An operation to address when the original virtual machine becomes unavailable. When the original virtual machine becomes unavailable, the replica virtual machine can be rolled back to a state consistent with the specified restore point to continue providing services. Failover allows services to be quickly restored to the state before the failure, ensuring application continuity and stability.
Failback

Failback refers to transferring the data that changed during the failover period back to the original virtual machine when it returns to normal, and then switching back to using the original virtual machine. Failback aims to synchronize data between the original virtual machine and the replica virtual machine, ensuring that the original virtual machine can continue to provide services after the failure is resolved.
Permanent failover

Permanent failover is a measure taken when the original virtual machine cannot recover from the failure. In this case, the replica virtual machine formally takes over the service role of the original virtual machine to ensure continuous operation and availability of the business.

RPO (Recovery Point Objective)

Recovery Point Objective refers to the acceptable time frame for data loss in a disaster recovery scenario. RPO is measured in time and indicates the time span from the most recent backup or restore point to the occurrence of the failure. A smaller RPO value indicates that the system allows less data loss in the event of a failure. A larger RPO value means that the system can tolerate a longer time range of data loss during a failure.

RTO (Recovery Time Objective)

Data recovery time objective. Refers to the time that a business system can tolerate from crash to recovery of operation. A shorter RTO value indicates that the business system requires rapid recovery in the event of a failure, while a longer RTO value means that the business system can tolerate a longer recovery time.

Network mapping

A configuration rule designed to address the network environment differences between the source virtual machine and the replica virtual machine.

Since the replica virtual machine and the original virtual machine are often in different network environments, completely copying the network configuration of the original virtual machine to the replica virtual machine may result in network connection issues in the target cluster's environment. By configuring network mapping rules, the VM network of the original virtual machine can be mapped to the corresponding VM network in the target cluster, ensuring that the replica virtual machine is normally connected to the network.

IP mapping

A configuration rule designed to solve the problem of IP address mismatch between the original virtual machine and the replica virtual machine in different network environments.

Since the replica virtual machine and the original virtual machine are often in different network environments, directly copying the IP address of the original virtual machine to the replica virtual machine may result in network connection issues in the target cluster. By configuring IP mapping rules, the replication service can automatically set the correct IP address, subnet mask, and gateway for the replica virtual machine during failover using Arcfra VMTools. This ensures that the replica virtual machine can provide services with the expected network configuration during failover without manually updating the IP address of the replica virtual machine. During failback, the replication service modifies the IP configuration of the original virtual machine to the IP configuration corresponding to the restore point selected at the time of failover.

ACOS cluster

The ACOS (Arcfra Cloud Operating System) cluster is a logical concept. In the production environment, a ACOS cluster consists of at least three nodes interconnected through a network.

AVE

AVE (Arcfra Virtualization Engine) refers to the KVM-based hypervisor used in the hyper-converged software when deploying ACOS directly on servers. Alongside essential functions like virtual machine lifecycle management, power operations, high availability, and cold or hot migration, ELF, when integrated with the storage component ABS (Arcfra Block Storage), offers advanced virtual machine services like sub-second snapshots, templates, and clones.

AVE is designed to be straightforward and user-friendly, allowing all operations within the ACOS hyper-converged cluster to be carried out seamlessly through AOC.

AOC

Arcfra Operation Center, which is Arcfra multi-cluster centralized resource management platform, designed to manage multiple Arcfra clusters and system services within and across data centers. The platform also provides complete and standard RESTful APIs and SDKs in multiple languages for managing the infrastructure.