OverviewDeploymentManagementOperationReference
    ABDR 2.1.0
  • Deploying Arcfra Backup and Disaster Recovery

Basic concepts

This chapter explains the technical terms involved in Arcfra Backup & Disaster Recovery. If you encounter any unfamiliar concepts while reading, you can find their definitions and detailed explanations here.

Backup and recovery

Backup service

The backup service is responsible for managing backup plans and performing backup and recovery jobs for virtual machines.

Backup resources

Backup plans, virtual machines added to backup plans, backup files, restore points, and backup repositories.

Backup plan

Backup plans are used to plan and schedule data backup operations in detail. With a backup plan, you can define backup objects, backup services, backup repositories, backup schedules, backup windows, and retention policies for restore points to ensure reliable data backup and achieve data recovery in failure scenarios, thereby realizing service continuity and reliability.

Backup object

Virtual machines to be backed up.

Backup schedule

The frequency and time at which the backup plan is automatically executed. It specifies how often the backup service backs up virtual machines. A shorter backup schedule allows for a smaller RPO (Recovery Point Objective). For example, if the backup schedule is set to 15 minutes, the maximum potential data loss is limited to 15 minutes in case of a disaster.

Backup window

The time range during which the backup plan is allowed to execute automatically.

Restore point

A restore point in the backup and recovery function is the virtual machine data state generated by each backup operation and stored as files in the backup repository. It fully records the virtual machine data and configuration, reflecting the point-in-time state when the virtual machine was backed up.

Restore points allow virtual machines to quickly recover to a specific data state when needed. For example, in a disaster recovery scenario, restoring a virtual machine to a previous point in time to restore normal operation of the virtual machine. Generating recovery points regularly can ensure that the data of the virtual machine is up to date, thus guaranteeing service continuity and availability. You can also use the backup files corresponding to the recovery point to rebuild a virtual machine in the state of the corresponding time point for development, testing and other purposes.

ACOS cluster

The ACOS (Arcfra Cloud Operating System) cluster is a logical concept. In the production environment, a ACOS cluster consists of at least three nodes interconnected through a network.

AVE

ACOS native virtualized computing platform, refers to the KVM-based hypervisor used in the hyper-converged software when deploying ACOS directly on servers. Alongside essential functions like virtual machine lifecycle management, power operations, high availability, and cold or hot migration, when integrated with the storage component ABS (Arcfra Block Storage), offers advanced virtual machine services like sub-second snapshots, templates, and clones.

AVE is designed to be straightforward and user-friendly, allowing all operations within the ACOS hyper-converged cluster to be carried out seamlessly through AOC.

AOC

Arcfra Operation Center, which is Arcfra multi-cluster centralized resource management platform, designed to manage multiple Arcfra clusters and system services within and across data centers. The platform also provides complete and standard RESTful APIs and SDKs in multiple languages for managing the infrastructure.

Replication and recovery

Replication service

The replication service is responsible for managing replication plans and performing jobs such as replication, failover, failback, and permanent failover for virtual machines.

Depending on the associated cluster, the replication service is divided into target replication service and source replication service.

  • Target replication service

    The replication service associated with the target cluster, which is responsible for executing replication plans, failover, managing replication resources, and reading data from the target cluster and transmitting it to the source replication service during failover. The target replication service is responsible for ensuring the synchronization of replicated data, guaranteeing data consistency between replica virtual machines and original virtual machines.

  • Source replication service

    The replication service associated with the source cluster, which is responsible for reading data from the source cluster and transmitting it to the target replication service. During failback, it receives data from the target replication service and writes it to the source cluster.

Site

A site refers to a failure domain managed by a single AOC, which can include one or more clusters and data centers. The resources of each site are managed by its dedicated AOC and are completely isolated in terms of geographical location, hardware devices, and software services, without sharing any resources with other sites.

Sites can be divided into the following two types:

  • Source site: The source site is usually the site where the business is actually running. The replication service replicates virtual machine from the source site to replica objects at the target site to achieve data synchronization.
  • Target site: The target site is typically used for disaster recovery, storing replica objects replicated from the source site.

Replication plan types

  • Intra-site replication

    The source cluster and target cluster are located in the same site. When the original virtual machine is unavailable, it can failover to the replica object in other clusters within the site to achieve cluster-level disaster recovery. If the AOC managing the site fails, failover will not be possible.

  • Cross-site replication

    The source cluster and target cluster are located in different sites. When the original virtual machine in the source site is unavailable, regardless of whether the AOC managing the source site has failed, it is possible to failover to replica objects at the target site, achieving site-level disaster recovery.

    Cross-site replication includes the following two types:

    • Incoming replication: The current site is the target site, and the source site is another site.
    • Outgoing replication: The current site is the source site, and the target site is another site.

Source cluster

Typically a production cluster. It is the cluster to which the original virtual machine belongs, hosting actual applications and data, and located at the source site.

Target cluster

Typically a disaster recovery cluster. It is a cluster that stores replica objects, located at the target site.

Virtual machine

Implemented through virtualization technology, it is a computer system running on a physical server's virtualization platform, with characteristics identical to a physical computer. All virtual machines on a physical server share the underlying physical hardware resources, but each virtual machine has its own independent operating system, applications, and resource configuration. Virtual machines are isolated from each other, running without affecting one another, and can be used for testing, development, and production environments.

When replicating virtual machines, two virtual machine roles are involved:

  • Original virtual machine

    Virtual machines that need disaster recovery protection, responsible for hosting actual applications and data.

  • Replica virtual machine

    A complete copy of the original virtual machine created through a replication plan. Replica virtual machines are typically used for disaster recovery. In the event of a failure of the original virtual machine, the replica virtual machine can be swiftly started and take over the responsibilities of the original one.

Replication resources

Replication plans, replication objects, and restore points.

Replication object

An original virtual machine and its replica virtual machine are called a set of replication objects.

Replica object

The replica virtual machine.

Replication plan

Replication plans are used to plan and schedule data replication operations. Through replication plans, users can define information such as replication objects, replication type, replication site, target cluster, replication service, replication schedule, replication window, retention policy, replica objects, and more.

Replication schedule

The frequency and time at which the replication plan automatically executes, used to specify the time interval for data replication. A shorter replication schedule allows for a smaller RPO (Recovery Point Objective). For example, if the replication schedule is set to 15 minutes, the maximum potential data loss is limited to 15 minutes in case of a disaster.

Replication window

The time range during which the replication plan is allowed to execute automatically.

Asynchronous replication

The replication method used by the replication service to replicate data from the source to the destination. In asynchronous replication, data from the source is not transmitted to the destination in real-time, but rather data transmission occurs periodically.

Restore point

A restore point in the replication and recovery function is a specific point in time from which data can be restored. It is a snapshot created during replication, failover (in immediate restore point generation mode), or failback and saved on the replica object.

Restore points allow replica virtual machines to quickly roll back to a specific data state when needed, for example, restoring a virtual machine to a previous point in time in a disaster recovery scenario to resume business operations. By generating restore points regularly, it ensures that the data in the replica object is up-to-date, and the snapshots corresponding to restore points can also be used to rebuild a replica object at a specific point in time, enabling development and testing without affecting production business.

Disaster Recovery

Recovery operations that can be performed after a failure occurs, including failover, failback, and permanent failover.

  • Failover

    Failover is an operation to address when the original virtual machine becomes unavailable. When the original virtual machine becomes unavailable, the replica object can be rolled back to a state consistent with the specified restore point to continue providing services. Failover allows services to be quickly restored to the state before the failure, ensuring application continuity and stability.

  • Failback

    Failback refers to the process of transferring the data that changed during the failover period back to the original virtual machine when it returns to normal use, and then switching back to using the original virtual machine. Failback aims to synchronize data between replication objects, ensuring that the original virtual machine can continue to provide services after the failure is resolved.

  • Permanent failover

    Permanent failover is a measure taken when the original virtual machine cannot recover from the failure. In this case, the replica object formally takes over the service role of the original virtual machine to ensure continuous operation and availability of the business.

ACOS cluster

The ACOS (Arcfra Cloud Operating System) cluster is a logical concept. In the production environment, a ACOS cluster consists of at least three nodes interconnected through a network.

AVE

AVE (Arcfra Virtualization Engine) refers to the KVM-based hypervisor used in the hyper-converged software when deploying ACOS directly on servers. Alongside essential functions like virtual machine lifecycle management, power operations, high availability, and cold or hot migration, when integrated with the storage component ABS (Arcfra Block Storage), offers advanced virtual machine services like sub-second snapshots, templates, and clones.

AVE is designed to be straightforward and user-friendly, allowing all operations within the ACOS hyper-converged cluster to be carried out seamlessly through AOC.

AOC

Arcfra Operation Center, which is Arcfra multi-cluster centralized resource management platform, designed to manage multiple Arcfra clusters and system services within and across data centers. The platform also provides complete and standard RESTful APIs and SDKs in multiple languages for managing the infrastructure.