Disaster tolerance system and disaster tolerance method thereof

A disaster recovery and management unit technology, applied in the field of network communication, can solve problems such as slow data recovery, difficulty in continuous operation of business systems, poor real-time performance, etc., to achieve the effect of ensuring continuity

Active Publication Date: 2010-01-27
ZTE CORP
0 Cites 74 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0005] 2. Disasters caused by man-made causes;
[0020] This method mainly uses backup software to realize backup and tape management. Its disadvantages are obvious. Because tapes are used to stor...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

The disaster recovery system that the present invention proposes is a kind of network disaster recovery system, and this system adopts distributed disaster recovery system structure: network service system disaster recovery (referring to the system disaster recovery of storage and forwarding management center) and network storage System disaster recovery (referring to the system disaster recovery of the data storage management center), integrates the network security mechanism and the network disaster recovery mechanism. The whole system has strong system protection and disaster tolerance capabilities, and also improves the system service quality. Use a dedicated storage network to synchronously mirror key data to the data resource pool, so that the data is not only protected locally, but also confirmed and protected in different places (backup).
[0073] The above four functional modules cooperate with each other to ensure that the data backup can be completed and the reliability of the data can be guaranteed.
[0084] In the above-mentioned system, each management unit needs to perform feedback and notification on its execution, and forward it through the remote store-and-forward management unit to ensure the correctness of data transmission from the previous link to the next link.
[0110] In this step, the centralized control and management unit comprehensively evaluates the recovery time and efficiency, issues the most economical and efficient recovery plan, and simultaneously quickly switches services between the local cluster and the remote cluster to ensure service continuity. For example: if only the data fails, just restore the backup data; if it is a local physical failure, you can activate the backup device; if there is a problem with a certain virtual module, let other virtual modules take over its work and activate the backup virtual module. If the data ...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention discloses a disaster tolerance system and a disaster tolerance method thereof. The method comprises the following steps: (1) a data backup management unit carries out local backup on data of a current production site, records the sequence of the backup data and tests the reliability thereof; (2) a remote store-and-forward management unit carries out remote replication and storage on the backup data; (3) the remote store-and-forward management unit tests the reliability of the backup data, maps the backup data into a date storage management unit to be stored after success, and simultaneously transmits the backup data into each standby production site; and (4) when the production site suffers catastrophic failures, the disaster tolerance management unit starts the standby production site, and carries out the fast recovery of data. The disaster tolerance system of the invention adopts a distributed disaster tolerance system structure: network service system disaster tolerance and network storage system disaster tolerance, and integrates a network security mechanism with a network disaster tolerance mechanism; and the entire system has strong system protection and disaster tolerance capability, and improves the service quality thereof.

Application Domain

Technology Topic

Image

  • Disaster tolerance system and disaster tolerance method thereof
  • Disaster tolerance system and disaster tolerance method thereof
  • Disaster tolerance system and disaster tolerance method thereof

Examples

  • Experimental program(3)

Example Embodiment

[0063] Example one
[0064] See figure 1 , The figure shows a schematic structural diagram of the disaster tolerance system of this embodiment. As shown in the figure, the disaster recovery system includes the following parts:
[0065] A. Centralized control management unit S100, adopts cluster management, adopts centralized management for local clusters and remote clusters, performs overall monitoring of local backup, remote replication, and recovery in the event of a disaster, and reports to each management unit The operation information is analyzed and processed in a timely manner. At the same time, the initial configuration and condition setting of each management unit are carried out during the initial installation of the system disaster recovery device.
[0066] The unit can automatically or manually switch the failed site to the available site. At the same time, it can also issue upload comparison and upload commands, that is, compare whether the data in each site and the data storage management unit are consistent, if inconsistent, the inconsistent data can be uploaded to the local (You can set up manual or automatic upload comparison and data upload commands).
[0067] Once the production site fails, the centralized control management unit S100 will comprehensively evaluate the recovery time and efficiency, issue the most cost-effective recovery plan, and quickly switch services between the local cluster and the remote cluster to ensure business continuity. Moreover, in order to ensure that the site can automatically synchronize data to the restored site after the interruption is restored, the centralized control management unit S100 will determine whether to incrementally upload data information to the interrupted site after the restoration according to the interruption time and the increased business volume. Or re-upload the latest data in a full amount (the default is automatic synchronization, and synchronization is automatically performed when a site is found to re-establish normal communication).
[0068] B. The data backup management unit S101 at the production site is mainly to complete the local backup of data and provide data benchmarks for remote mirroring. The unit mainly includes four functional modules: a backup management module, a reliability verification module, a monitoring analysis module, and an exception handling module.
[0069] b1) Backup management module: When backing up data, the volume is used as the basic unit for backup. First-in-first-out queue is used. If an interruption occurs, a logo is inserted at the breakpoint, that is, a label is placed locally to identify the data in each data volume The changing position of the block.
[0070] b2) Reliability verification module: it is the verification of the correctness of the backup data, that is, the verification of the consistency with the current local data. At the same time, the recoverability of the data must be tested to ensure the reliability of the data Sex.
[0071] b3) Monitoring and analysis module: It mainly analyzes the errors in the backup process, displays the cause of the error and related solutions, and delivers the exception handling module that can be automatically processed for processing, and records the operation in the operation log; this module also To monitor the running status of related processes during the backup process, monitor whether the backup is running normally, if the process dies or exits, the exception handling module is delivered to call the process again.
[0072] b4) Exception handling module: handles exceptions during the backup process. When there is an exception in the backup process, you can roll back to the situation where there is no exception and re-backup (just re-backup the backup data that has not been completed); other abnormal situations are operated according to the solutions provided by the error analysis module. If it cannot be handled automatically by the device, the user will be notified to handle it manually.
[0073] The above four functional modules cooperate with each other to ensure that data backup can be completed and the reliability of data can be guaranteed.
[0074] C. The remote storage and forwarding management unit S102 is mainly for the unified splitting of commands and data. One is responsible for remote data replication of the database, ensuring the reliability of the data during the remote replication process, and simultaneously sending it to each standby site; The second is to forward commands between modules. This unit includes three parts: data resource pool, data reliability check and command processing module.
[0075] c1) Data resource pool: It is mainly to store synchronously mirrored data on multiple virtual modules, each virtual module does not occupy the same disk partition, and the shared virtual module is reserved, and the data is mapped to the data storage management unit. storage. Among them, the data resource pool receives mapping data from the production center, and the remote data resource pool can be shared by multiple disaster recovery centers. The data resource pool will only save the latest complete backup to facilitate rapid recovery in the event of a disaster. All historical data will be stored in the data storage management unit.
[0076] Among them, when backing up data, all data is backed up for the first time, and subsequent backups are incremental backups, that is, only the changed data is backed up, and the incremental backup is integrated and written to the corresponding storage device (backup disk data to tape ). Moreover, the near-line copy technology is used during backup, that is, data with different values ​​is stored on storage media with different performance (price). At the same time, it is convenient for important data to be protected first in an emergency. At the same time, the data is stored on the virtual module according to business data and non-business data, and business data is classified into important business data and light business data, and the main business can be restored first in the event of a disaster.
[0077] c2) Remote reliability verification module: It mainly verifies the correctness of the backup data, that is, the consistency verification of the local backup data and the data in the production center. At the same time, the data recoverability must be tested to ensure Reliability of data: When the reliability of the data is verified correctly, the data is concurrently sent to each standby site to facilitate rapid local recovery.
[0078] c3) Command processing module: It is mainly used for request response and command forwarding between communication entities (communication entities refer to various management units and production sites and standby sites). This module is responsible for the scheduling of commands and records the commands with a linked list. The commands include the commands issued by the centralized control management unit, the commands reported by the data storage management unit, and the interactive commands between the sites. If the response is issued, then Register in the linked list to find the corresponding scheduling command when the response command is fed back.
[0079] D. The disaster recovery management unit S103 is mainly used to automatically enable the standby site with the highest priority as the active site when the production site (also referred to as the active site) fails, and realize rapid recovery of important data.
[0080] Furthermore, the virtual module restores a single physical disaster target or automatically saves data in the event of a power failure (automatically switch to the backup system, and can automatically switch the work of the virtual module to another physical server). Once the disaster recovery mechanism is activated, a related virtual machine disaster recovery site will replace the failed site equipment to work, and its workflow can be customized.
[0081] E. The data storage management unit S104 is mainly used for data storage and dumping. This unit mainly includes two parts:
[0082] e1) Control module: Monitor and uniformly manage the data storage situation, verify the reliability of the data; at the same time, respond to the issued data upload comparison and upload operation commands.
[0083] e2) Data storage module: Mainly store business data and non-business data, and classify business data, divided into important business data and light business data, in the event of a disaster, the main business can be restored first; when the amount of stored data is too large At the time, the snapshot technology is used to back up the data to the tape library or optical disc library.
[0084] In the above-mentioned system, each management unit needs to give feedback notifications of its execution status, and forward it through the remote storage and forwarding management unit to ensure the correctness of data transmission from the previous link to the next link.
[0085] No matter which production site is, it may be promoted to the primary site. Therefore, the primary site and the standby site should have the functions of a data backup management unit and a disaster recovery management unit. In the figure, S101' and S103' are the systems that are started when the data of the current production site is synchronized to the local after the failure of the production site is restored. The working principle of the remote storage and forwarding process is the same, but not marked, that is, S102 and S102' are the same.
[0086] Among them, in the event of a disaster, when the backup site detects that the production site is operating abnormally or loses communication, the backup site with the highest priority will automatically assume the work of the production site. When the production site returns to normal working status or the production site communication returns to normal, You need to synchronize the data to the production site and return master control to the production site.
[0087] See figure 2 , The figure shows the disaster recovery method of the above disaster recovery system, specifically:
[0088] Step S200, the data backup management unit backs up the local data.
[0089] During the backup process, the data of the production site can be written online into the virtual module of the local storage device, and the local primary and secondary sites can access the storage device at the same time, that is, share the same data pool; dual backup storage devices can also be provided to facilitate A backup storage device can be activated when a storage device fails. Each of the above virtual modules does not occupy the same disk partition, and shared virtual modules are reserved to improve protection capabilities. The virtual module restores a single physical disaster target or automatically saves data in the event of a power failure (automatically switch to the backup system, and can automatically switch the work of the virtual module to another physical service module).
[0090] When backing up data, all data is backed up for the first time, and subsequent backups are incremental backups, that is, only the changed data is backed up, and the incremental backups are integrated and written to the corresponding storage device (backup disk data to tape) .
[0091] There is a return arrow in the figure, which means that when the local backup fails, the system will re-execute this backup.
[0092] In step S201, the operation steps are recorded in sequence. Here, a first-in, first-out queue is used to confirm the order of the backup data. If an interruption occurs, an identifier is inserted at the breakpoint, that is, a label is placed locally to identify the change position of the data block in each data volume.
[0093] Step S202: Perform reliability verification on the backup data.
[0094] Reliability verification refers to the verification of the correctness of the backup data, that is, the verification of the consistency between the backup data and the current local data. At the same time, it also includes the detection of the recoverability of the data to ensure the integrity of the data. reliability. After the reliability verification, the verification result needs to be returned. If the verification fails, the information needs to be returned to step S200 and the backup operation is performed again.
[0095] Step S203: Taking the data backed up by the data backup management unit as a reference, the remote storage and forwarding management unit remotely mirrors the backup data.
[0096] In order to ensure the reliability of data transmission during the remote replication of data, a first-in, first-out queue is used to ensure the sequence of remote replication. The sequence of replication is the sequence during backup. If an interruption occurs, an identifier is inserted at the breakpoint. , That is, place a label locally to identify the changing position of the data block in each data volume.
[0097] At the same time, in the process of data remote replication, the volume-based replication method is used, and the IP-based interconnection protocol is used to remotely replicate the information of the production site to the remote cluster through the existing TCP/IP network.
[0098] Among them, the production site and the remote storage and forwarding management unit use a dedicated network to transmit backup data in a pre-set manner, and at the same time, use heartbeat to monitor the status between each site. Once an abnormality occurs, the disaster recovery mechanism will be automatically activated. Moreover, redundant channels are also provided in the connection of the equipment in the local and remote systems to prepare for the timely replacement of work when the working channel fails.
[0099] Step S204, the remote storage and forwarding management unit performs reliability verification on the backup data.
[0100] The reliability check here refers to the correctness check of the data mirrored to the remote cluster, that is, the consistency check of the data at the production site and the data backed up by the cluster. At the same time, the data recoverability must also be checked. Perform testing to ensure the reliability of the data.
[0101] After the reliability verification, the verification result needs to be returned. If the verification fails, the information needs to be returned to step S203, and the operation of mirroring the locally stored data to the remote cluster is performed again, and at the same time, a mark is made in the record.
[0102] Step S205: The remote storage and forwarding management unit maps the data to the data storage management unit for storage, and simultaneously copies the data to multiple virtual modules.
[0103] In this step, the remote storage and forwarding management unit maps the backup data to the data storage management unit for storage, and stores the backup data on multiple virtual modules. Each virtual module does not occupy the same disk partition and reserves shared virtual Module, after data verification is correct, the data will be sent to each backup site concurrently.
[0104] Among them, if the amount of data stored in the data storage management unit is too large, it will be backed up to a tape library or an optical disk library using snapshot technology.
[0105] Step S206: Reliability verification is that the remote storage and forwarding management unit verifies the reliability of the copied data on each virtual module.
[0106] Step S207, forwarding the data to the standby site, means that the remote storage and forwarding management unit forwards the data to each standby site, so that the standby site can be updated synchronously.
[0107] Step S208: Reliability verification refers to the reliability verification performed by each backup site, and the data storage management unit performs reliability verification on the data backed up locally.
[0108] Among them, each site is connected by a "handshake line", which can monitor the heartbeat and also verify the consistency of the data.
[0109] In step S209, the backup data is quickly restored when a catastrophic failure occurs.
[0110] In this step, the centralized control and management unit comprehensively evaluates the recovery time and efficiency, issues the most economical and efficient recovery plan, and quickly switches services between the local cluster and the remote cluster to ensure business continuity. For example: if only the data fails, restore the backup data; if it is a local physical failure, you can enable the backup device; if a virtual module has a problem, let the other virtual module take over its work while enabling the backup virtual module. If the data center is damaged by a natural disaster, a remote disaster recovery cluster will take over the work of the data center.
[0111] The use of centralized control management can more comprehensively control the flow of data and the working status of various related links, comprehensively evaluate and give a cost-effective recovery strategy, and minimize the economic loss of the enterprise in the event of a disaster.

Example Embodiment

[0112] Example two
[0113] See image 3 , The figure shows a schematic diagram of the implementation scheme of the disaster tolerance system of this embodiment, image 3 Yes Figure 4 A special case of only one standby site, the following image 3 The work flow of the disaster recovery system is explained (each management center in this figure refers to the server where the corresponding management unit is located). The system includes: centralized control management center S300; production site S301; standby site S302; remote storage and forwarding management center S303; data storage management center (main) S304; data storage management center (standby) S305.
[0114] The centralized control management center S300 mainly performs overall monitoring of the entire disaster recovery system, issues related configuration operations, and receives the information reported by each management unit forwarded by the remote storage and forwarding management center S303.
[0115] When a disaster occurs, the centralized control management center S300 comprehensively evaluates and provides a cost-effective recovery strategy based on the working status of each related link, which can quickly restore the business. Each site can send upload comparison and upload data operations through the centralized control management center S300 (manual upload or automatic upload can be set) to facilitate manual data synchronization for sites recovered after an abnormality. By default, when the site re-establishes communication The data will be automatically synchronized once. Moreover, the centralized control management center S300 decides whether to upload data incrementally or in full according to the time of the site interruption.
[0116] The production site S301 backs up the data locally, and maps the backup data to the remote storage and forwarding management center S303.
[0117] The standby site S302, when a disaster occurs, replaces the production site S301 to continue working.
[0118] Among them, there is a handshake line between the production site S301 and the backup site S302 to monitor the working conditions of the production site S301, and the production site S301 and the backup site S302 can share the backup data to the remote storage and forwarding management center S303. The communication status of the site, as well as the backup and restoration status will be forwarded to the centralized control management center S300 through the remote control management center S303.
[0119] The remote storage and forwarding management center S303, which maps the production site S301 to the local backup data and stores them in the data storage management centers S304 and S305, and copies the backup data to multiple virtual modules in the remote storage and forwarding management center S303. The backup data that is correct after the inspection is forwarded to the standby site S302 for synchronization.
[0120] The data storage management center (main) S304 is mainly used to store the data of the production site S301. When the amount of data stored in the data storage management center S304 is too large, the snapshot technology is used to back it up to the tape library or the optical disk library. The data storage management center S304 can monitor the backup process through the data storage module, and report the expired data storage media that can be stored in the tape library or optical disc library to the user, and inform the user that the data stored on the media has expired. The medium can be reused.
[0121] The data storage management center (standby) S305 has basically the same function as the data storage management center (main) S304, mainly for data storage, the data storage management center (standby) S305 and the data storage management center (main) S304 data It is consistent. When the data storage management center (main) S304 fails, the backup data can be obtained from the data storage management center (standby) S305.
[0122] Supplementary explanation, for ease of explanation, image 3 Only one remote storage and forwarding management center S303 is drawn in, but no backup remote storage and forwarding management center is drawn.

Example Embodiment

[0123] Example three
[0124] See Figure 4 The figure shows a schematic diagram of the implementation scheme of the disaster recovery system of this embodiment, which includes: centralized control management center S400, production site S401, backup site 1S402, backup site N S403, remote storage and forwarding management center (main) S404, Remote storage and forwarding management center (standby) S405, data storage management center (main) S406, and data storage management center (standby) S407.
[0125] This figure is a schematic diagram of another implementation scheme of the disaster recovery system of the present invention, and its working principle is the same as image 3 The disaster recovery system with only one backup site is basically the same and will not be repeated.
[0126] This is only a supplementary explanation. For multiple backup sites, the priority of their backup sites can be set in the centralized control management center S400 (automatically generated by default, or manually modified).
[0127] Among them, the backup remote storage and forwarding management center S405 has the same function as the active remote storage and forwarding management center S404. While the active remote storage and forwarding management center S404 is working, the backup remote storage and forwarding management center S405 is in a monitoring state. During normal operation, the backup remote storage and forwarding management center S405 automatically replaces the main remote storage and forwarding management center S404 to perform work.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Communication method of multi-path router for wireless MESH network

InactiveCN101765178AGuaranteed continuityEnhance network robustnessNetwork topologiesMaintenance strategyTraffic volume
Owner:上海寰创通信科技股份有限公司

Classification and recommendation of technical efficacy words

Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products