[0063] Example one
[0064] See figure 1 , The figure shows a schematic structural diagram of the disaster tolerance system of this embodiment. As shown in the figure, the disaster recovery system includes the following parts:
[0065] A. Centralized control management unit S100, adopts cluster management, adopts centralized management for local clusters and remote clusters, performs overall monitoring of local backup, remote replication, and recovery in the event of a disaster, and reports to each management unit The operation information is analyzed and processed in a timely manner. At the same time, the initial configuration and condition setting of each management unit are carried out during the initial installation of the system disaster recovery device.
[0066] The unit can automatically or manually switch the failed site to the available site. At the same time, it can also issue upload comparison and upload commands, that is, compare whether the data in each site and the data storage management unit are consistent, if inconsistent, the inconsistent data can be uploaded to the local (You can set up manual or automatic upload comparison and data upload commands).
[0067] Once the production site fails, the centralized control management unit S100 will comprehensively evaluate the recovery time and efficiency, issue the most cost-effective recovery plan, and quickly switch services between the local cluster and the remote cluster to ensure business continuity. Moreover, in order to ensure that the site can automatically synchronize data to the restored site after the interruption is restored, the centralized control management unit S100 will determine whether to incrementally upload data information to the interrupted site after the restoration according to the interruption time and the increased business volume. Or re-upload the latest data in a full amount (the default is automatic synchronization, and synchronization is automatically performed when a site is found to re-establish normal communication).
[0068] B. The data backup management unit S101 at the production site is mainly to complete the local backup of data and provide data benchmarks for remote mirroring. The unit mainly includes four functional modules: a backup management module, a reliability verification module, a monitoring analysis module, and an exception handling module.
[0069] b1) Backup management module: When backing up data, the volume is used as the basic unit for backup. First-in-first-out queue is used. If an interruption occurs, a logo is inserted at the breakpoint, that is, a label is placed locally to identify the data in each data volume The changing position of the block.
[0070] b2) Reliability verification module: it is the verification of the correctness of the backup data, that is, the verification of the consistency with the current local data. At the same time, the recoverability of the data must be tested to ensure the reliability of the data Sex.
[0071] b3) Monitoring and analysis module: It mainly analyzes the errors in the backup process, displays the cause of the error and related solutions, and delivers the exception handling module that can be automatically processed for processing, and records the operation in the operation log; this module also To monitor the running status of related processes during the backup process, monitor whether the backup is running normally, if the process dies or exits, the exception handling module is delivered to call the process again.
[0072] b4) Exception handling module: handles exceptions during the backup process. When there is an exception in the backup process, you can roll back to the situation where there is no exception and re-backup (just re-backup the backup data that has not been completed); other abnormal situations are operated according to the solutions provided by the error analysis module. If it cannot be handled automatically by the device, the user will be notified to handle it manually.
[0073] The above four functional modules cooperate with each other to ensure that data backup can be completed and the reliability of data can be guaranteed.
[0074] C. The remote storage and forwarding management unit S102 is mainly for the unified splitting of commands and data. One is responsible for remote data replication of the database, ensuring the reliability of the data during the remote replication process, and simultaneously sending it to each standby site; The second is to forward commands between modules. This unit includes three parts: data resource pool, data reliability check and command processing module.
[0075] c1) Data resource pool: It is mainly to store synchronously mirrored data on multiple virtual modules, each virtual module does not occupy the same disk partition, and the shared virtual module is reserved, and the data is mapped to the data storage management unit. storage. Among them, the data resource pool receives mapping data from the production center, and the remote data resource pool can be shared by multiple disaster recovery centers. The data resource pool will only save the latest complete backup to facilitate rapid recovery in the event of a disaster. All historical data will be stored in the data storage management unit.
[0076] Among them, when backing up data, all data is backed up for the first time, and subsequent backups are incremental backups, that is, only the changed data is backed up, and the incremental backup is integrated and written to the corresponding storage device (backup disk data to tape ). Moreover, the near-line copy technology is used during backup, that is, data with different values is stored on storage media with different performance (price). At the same time, it is convenient for important data to be protected first in an emergency. At the same time, the data is stored on the virtual module according to business data and non-business data, and business data is classified into important business data and light business data, and the main business can be restored first in the event of a disaster.
[0077] c2) Remote reliability verification module: It mainly verifies the correctness of the backup data, that is, the consistency verification of the local backup data and the data in the production center. At the same time, the data recoverability must be tested to ensure Reliability of data: When the reliability of the data is verified correctly, the data is concurrently sent to each standby site to facilitate rapid local recovery.
[0078] c3) Command processing module: It is mainly used for request response and command forwarding between communication entities (communication entities refer to various management units and production sites and standby sites). This module is responsible for the scheduling of commands and records the commands with a linked list. The commands include the commands issued by the centralized control management unit, the commands reported by the data storage management unit, and the interactive commands between the sites. If the response is issued, then Register in the linked list to find the corresponding scheduling command when the response command is fed back.
[0079] D. The disaster recovery management unit S103 is mainly used to automatically enable the standby site with the highest priority as the active site when the production site (also referred to as the active site) fails, and realize rapid recovery of important data.
[0080] Furthermore, the virtual module restores a single physical disaster target or automatically saves data in the event of a power failure (automatically switch to the backup system, and can automatically switch the work of the virtual module to another physical server). Once the disaster recovery mechanism is activated, a related virtual machine disaster recovery site will replace the failed site equipment to work, and its workflow can be customized.
[0081] E. The data storage management unit S104 is mainly used for data storage and dumping. This unit mainly includes two parts:
[0082] e1) Control module: Monitor and uniformly manage the data storage situation, verify the reliability of the data; at the same time, respond to the issued data upload comparison and upload operation commands.
[0083] e2) Data storage module: Mainly store business data and non-business data, and classify business data, divided into important business data and light business data, in the event of a disaster, the main business can be restored first; when the amount of stored data is too large At the time, the snapshot technology is used to back up the data to the tape library or optical disc library.
[0084] In the above-mentioned system, each management unit needs to give feedback notifications of its execution status, and forward it through the remote storage and forwarding management unit to ensure the correctness of data transmission from the previous link to the next link.
[0085] No matter which production site is, it may be promoted to the primary site. Therefore, the primary site and the standby site should have the functions of a data backup management unit and a disaster recovery management unit. In the figure, S101' and S103' are the systems that are started when the data of the current production site is synchronized to the local after the failure of the production site is restored. The working principle of the remote storage and forwarding process is the same, but not marked, that is, S102 and S102' are the same.
[0086] Among them, in the event of a disaster, when the backup site detects that the production site is operating abnormally or loses communication, the backup site with the highest priority will automatically assume the work of the production site. When the production site returns to normal working status or the production site communication returns to normal, You need to synchronize the data to the production site and return master control to the production site.
[0087] See figure 2 , The figure shows the disaster recovery method of the above disaster recovery system, specifically:
[0088] Step S200, the data backup management unit backs up the local data.
[0089] During the backup process, the data of the production site can be written online into the virtual module of the local storage device, and the local primary and secondary sites can access the storage device at the same time, that is, share the same data pool; dual backup storage devices can also be provided to facilitate A backup storage device can be activated when a storage device fails. Each of the above virtual modules does not occupy the same disk partition, and shared virtual modules are reserved to improve protection capabilities. The virtual module restores a single physical disaster target or automatically saves data in the event of a power failure (automatically switch to the backup system, and can automatically switch the work of the virtual module to another physical service module).
[0090] When backing up data, all data is backed up for the first time, and subsequent backups are incremental backups, that is, only the changed data is backed up, and the incremental backups are integrated and written to the corresponding storage device (backup disk data to tape) .
[0091] There is a return arrow in the figure, which means that when the local backup fails, the system will re-execute this backup.
[0092] In step S201, the operation steps are recorded in sequence. Here, a first-in, first-out queue is used to confirm the order of the backup data. If an interruption occurs, an identifier is inserted at the breakpoint, that is, a label is placed locally to identify the change position of the data block in each data volume.
[0093] Step S202: Perform reliability verification on the backup data.
[0094] Reliability verification refers to the verification of the correctness of the backup data, that is, the verification of the consistency between the backup data and the current local data. At the same time, it also includes the detection of the recoverability of the data to ensure the integrity of the data. reliability. After the reliability verification, the verification result needs to be returned. If the verification fails, the information needs to be returned to step S200 and the backup operation is performed again.
[0095] Step S203: Taking the data backed up by the data backup management unit as a reference, the remote storage and forwarding management unit remotely mirrors the backup data.
[0096] In order to ensure the reliability of data transmission during the remote replication of data, a first-in, first-out queue is used to ensure the sequence of remote replication. The sequence of replication is the sequence during backup. If an interruption occurs, an identifier is inserted at the breakpoint. , That is, place a label locally to identify the changing position of the data block in each data volume.
[0097] At the same time, in the process of data remote replication, the volume-based replication method is used, and the IP-based interconnection protocol is used to remotely replicate the information of the production site to the remote cluster through the existing TCP/IP network.
[0098] Among them, the production site and the remote storage and forwarding management unit use a dedicated network to transmit backup data in a pre-set manner, and at the same time, use heartbeat to monitor the status between each site. Once an abnormality occurs, the disaster recovery mechanism will be automatically activated. Moreover, redundant channels are also provided in the connection of the equipment in the local and remote systems to prepare for the timely replacement of work when the working channel fails.
[0099] Step S204, the remote storage and forwarding management unit performs reliability verification on the backup data.
[0100] The reliability check here refers to the correctness check of the data mirrored to the remote cluster, that is, the consistency check of the data at the production site and the data backed up by the cluster. At the same time, the data recoverability must also be checked. Perform testing to ensure the reliability of the data.
[0101] After the reliability verification, the verification result needs to be returned. If the verification fails, the information needs to be returned to step S203, and the operation of mirroring the locally stored data to the remote cluster is performed again, and at the same time, a mark is made in the record.
[0102] Step S205: The remote storage and forwarding management unit maps the data to the data storage management unit for storage, and simultaneously copies the data to multiple virtual modules.
[0103] In this step, the remote storage and forwarding management unit maps the backup data to the data storage management unit for storage, and stores the backup data on multiple virtual modules. Each virtual module does not occupy the same disk partition and reserves shared virtual Module, after data verification is correct, the data will be sent to each backup site concurrently.
[0104] Among them, if the amount of data stored in the data storage management unit is too large, it will be backed up to a tape library or an optical disk library using snapshot technology.
[0105] Step S206: Reliability verification is that the remote storage and forwarding management unit verifies the reliability of the copied data on each virtual module.
[0106] Step S207, forwarding the data to the standby site, means that the remote storage and forwarding management unit forwards the data to each standby site, so that the standby site can be updated synchronously.
[0107] Step S208: Reliability verification refers to the reliability verification performed by each backup site, and the data storage management unit performs reliability verification on the data backed up locally.
[0108] Among them, each site is connected by a "handshake line", which can monitor the heartbeat and also verify the consistency of the data.
[0109] In step S209, the backup data is quickly restored when a catastrophic failure occurs.
[0110] In this step, the centralized control and management unit comprehensively evaluates the recovery time and efficiency, issues the most economical and efficient recovery plan, and quickly switches services between the local cluster and the remote cluster to ensure business continuity. For example: if only the data fails, restore the backup data; if it is a local physical failure, you can enable the backup device; if a virtual module has a problem, let the other virtual module take over its work while enabling the backup virtual module. If the data center is damaged by a natural disaster, a remote disaster recovery cluster will take over the work of the data center.
[0111] The use of centralized control management can more comprehensively control the flow of data and the working status of various related links, comprehensively evaluate and give a cost-effective recovery strategy, and minimize the economic loss of the enterprise in the event of a disaster.