A method for backing up electronic files of geological data

By constructing an intelligent, layered backup architecture and adopting dynamic backup strategies and multi-layered security mechanisms, the problems of low storage efficiency and insufficient security in the backup of electronic files of geological data have been solved, and efficient and secure data management and recovery capabilities have been achieved.

CN122309248APending Publication Date: 2026-06-30DEV RES CENT OF CHINA GEOLOGICAL SURVEY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
DEV RES CENT OF CHINA GEOLOGICAL SURVEY
Filing Date
2026-04-03
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies for backing up electronic files of geological data suffer from low storage efficiency, insufficient security, and complex management, making it difficult to meet the professional and security protection needs of massive, multi-source, heterogeneous data.

Method used

By employing dynamic backup strategies, hash indexes, automatic partitioning strategies, encrypted storage, and multi-layered security mechanisms, combined with role-based access control and operation audit logs, an intelligent, layered backup architecture is constructed to achieve data block-level redundancy elimination and differentiated permission management.

Benefits of technology

It significantly improves the storage efficiency and data security of the backup system, ensures the reliability and management standardization of geological data, and solves the problems of large storage space occupation, low backup efficiency and insufficient security in traditional backup methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309248A_ABST
    Figure CN122309248A_ABST
Patent Text Reader

Abstract

This invention relates to a method for backing up electronic geological data files, belonging to the field of data backup. The method includes: a dynamic backup strategy determination step, which classifies and grades data according to its confidentiality, size, and intended backup medium, matching differentiated strategies and executing backups using a pipelined scheduling approach; a content-based redundancy deduplication step, which uses MD5+SM3+CRC32 calculations to build a global index to eliminate redundant data; a file segmentation and reassembly step, which automatically segments large files exceeding 50GB into 10GB sub-files and records the segmentation metadata; a file storage and security step, which randomly selects 2-3 data blocks based on the size of the electronic geological data file for encrypted replacement, encrypts metadata using the SM4 algorithm, and ensures data security through role-based access control, integrity verification, and audit logs; and a backup check and recovery step, which performs automatic verification, annual sampling checks, and heterogeneous recovery. The method improves storage efficiency through intelligent deduplication and tiered strategies, and enhances data security through encryption and integrity verification.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data backup technology, specifically a method for backing up electronic files of geological data. Background Technology

[0002] Electronic geological data files are the core output of geological work and belong to the nation's fundamental data resources. They are characterized by their massive data volume, diverse formats, long storage periods, and often involve sensitive information. With the rapid development of geoscientific applications such as remote sensing technology, 3D modeling, and artificial intelligence, the granularity of geological data acquisition is becoming increasingly fine, and the data volume is growing exponentially, with a single project generating geological data that can reach hundreds of terabytes. To ensure the security of these high-value digital assets, traditional backup methods typically employ simple file copying or rely on general-purpose commercial backup software. However, these general solutions are not optimized for the characteristics of geological data and are insufficient to meet the professional and security protection needs of massive, multi-source, and heterogeneous data environments.

[0003] Specifically, existing technologies suffer from the following significant drawbacks: First, they are inefficient in storage. General full backup strategies generate massive redundant copies when dealing with large amounts of data, consuming a significant amount of storage space. They also lack effective mechanisms for deleting redundant data specific to geological data content, leading to high storage costs. Second, they lack security. Geological data often involves national and commercial secrets. General backup methods typically do not employ encryption measures during storage, making them vulnerable to targeted data breaches and lacking fine-grained access control. Third, their strategies are rigid and complex to manage. Existing backup strategies are out of touch with the actual business scenarios of geological data classification and hierarchy, failing to differentiate scheduling based on data importance and access frequency, resulting in excessively long backup windows and insufficient protection for critical data. Furthermore, the lack of effective organization of backup files and metadata management often leads to difficulties in locating data and complex recovery processes when verifying data integrity or restoring specific historical versions. Therefore, there is an urgent need for a specialized backup method specifically designed for the characteristics of electronic geological data files, capable of balancing storage efficiency, data security, and intelligent strategies. Summary of the Invention

[0004] In order to solve the problems of the prior art, the present invention provides a method for backing up electronic files of geological data.

[0005] To solve the above-mentioned technical problems, the present invention is achieved through the following technical solution: a method for backing up electronic files of geological data, comprising the following steps: S1: Steps for determining the dynamic backup strategy: Geological data electronic files are classified and graded according to their confidentiality, data size, and intended backup media, and differentiated backup strategies are set for different levels of geological data electronic files and different types of backup media. According to the backup strategy, hard disk backups should be performed first, and tape and optical disc backups should be performed simultaneously according to the work task schedule. Based on the pre-calculation of the execution time of three types of tasks—read, write, and backup—a pipelined structure is used to automatically schedule backup tasks. For files exceeding 50GB, an automatic splitting strategy is used; for other files and the split data, a hash mode is used to remove redundancy and duplicates. S2: Content-based redundancy deduplication steps: The electronic files of geological data are divided into blocks according to their size. Geological data files exceeding 50GB are automatically split into multiple sub-files of 10GB each. A global hash index is established by calculating a strong hash value for each data block using the MD5+SM3 algorithm and supplementing it with a CRC32 value. The hash value of the new data block is compared with the index. If the same hash value exists, it is determined to be a duplicate block, and only the reference pointer is recorded in the backup metadata. If it does not exist, the data block is stored and the hash index is updated. S3: File splitting and reassembly steps: For undivided electronic geological data files, generate and store basic metadata, recording the original file name, total size, and file hash value; For the electronic files of geological data that have been segmented, generate and store segmentation metadata, recording the original file name, total size, number of segmented blocks, hash value of each block and its order; S4: File Storage and Security Steps: Based on the size of the electronic file of the geological data, randomly select 2-3 data blocks for encryption and replacement, and record the relevant encryption block information; Based on the SM4 algorithm, the catalog information, segmentation information, hash index information, and encrypted data block information of geological data electronic files are encrypted and stored. Manage access permissions for backup data based on a role- and user-based access control model; Calculate and store hash values ​​for each backed-up data block and metadata record, and perform integrity and originality scans periodically; Record all backup, restore, deletion, configuration change and other operations, generate audit logs and store them in encrypted form; S5: Backup check and restore steps: After each backup task is completed, backup verification is automatically performed, including hash verification and random sampling recovery verification; Each year, 10% of all backup data is sampled and checked. Optical discs are checked by readback, magnetic tapes are checked by recovery read, and hard drives are checked by SMART detection plus hash verification. For the problems found during the inspection, data recovery was performed using a backup and recovery method based on different media. A backup inspection report is generated based on the inspection findings, recording the inspection time, media number, inspection status, and inspector information.

[0006] In one specific implementation of the first aspect, the file optimization strategy in the dynamic backup strategy determination step further includes: using an automatic splitting strategy for files exceeding 50GB, and using a hash mode for redundancy removal for other files and the split data.

[0007] In one specific implementation of the first aspect, hash calculation and indexing further includes: calculating hash values ​​using the MD5+SM3 algorithm and establishing a global hash index library after supplementing with CRC32 values.

[0008] In one specific implementation of the first aspect, large file splitting further includes: automatically splitting geological data files exceeding 50GB into multiple sub-files of 10GB each, and automatically completing any parts of the last file that are less than 10GB.

[0009] In one specific implementation of the first aspect, metadata protection further includes: randomly selecting 2-3 data blocks based on the size of the electronic file of geological data for encrypted replacement, and encrypting and storing directory information, segmentation information, hash index information and encrypted data block information based on the SM4 algorithm.

[0010] In one specific implementation of the first aspect, the backup check further includes: performing a 10% sampling check on all backup data annually, and employing differentiated check methods for different media.

[0011] The beneficial effects of this invention are as follows: 1. By constructing an intelligent, layered backup architecture tailored to the characteristics of electronic geological data files, the storage efficiency, data security, and policy execution flexibility of the backup system are significantly improved. Firstly, at the storage optimization level, this invention abandons the traditional full-copy mode and adopts an MD5+SM3+CRC32 hash index mechanism to achieve global redundancy elimination at the data block level. Simultaneously, for ultra-large files exceeding 50GB, a strategy of automatically splitting them into 10GB sub-files is implemented, and any remaining sub-files less than 10GB are padded. Combined with a pipelined scheduling structure that pre-calculates read, compute, and write task times, this effectively balances system I / O load and computing resource utilization, avoiding the problem of prolonged backup window caused by long-term resource occupation by a single task. Thus, it fundamentally solves the technical challenges of large storage space consumption and low backup efficiency when backing up massive amounts of geological data.

[0012] 2. This invention constructs a defense-in-depth system through multiple technical means. By randomly selecting data blocks for encryption in electronic geological data files, and using the SM4 algorithm to encrypt and store core metadata such as directory information, segmentation mapping, hash index library, and data encryption index, the possibility of restoring the file structure through media content without a key is eliminated, significantly enhancing data security. Combined with a role-based access control model and an encrypted anti-tampering mechanism for operation audit logs, end-to-end traceability and non-repudiation are achieved from access permissions to operational behavior. Furthermore, this invention establishes a routine data integrity verification mechanism. After each backup, hash verification and random sampling recovery verification are automatically performed, and all backup data are subjected to differentiated sampling checks based on media type annually (CD-ROM readback, tape recovery reading, hard disk SMART detection combined with hash verification). Problems discovered are repaired using a non-same-media recovery strategy, ensuring that backup data is not only complete during static storage but also reliable and usable during dynamic recovery, thereby comprehensively improving the reliability and management standardization of long-term preservation of electronic geological data files. Attached Figure Description

[0013] Figure 1 This is a schematic diagram of the overall process of the method of the present invention.

[0014] Figure 2 This is a detailed flowchart illustrating the steps for determining the dynamic backup strategy of the present invention.

[0015] Figure 3 This is a schematic diagram of the strategy execution scheduling pipeline structure of the present invention.

[0016] Figure 4 This is a detailed flowchart illustrating the content-based redundancy deduplication steps of the present invention.

[0017] Figure 5 This is a schematic diagram of the module structure of the file storage and security steps of the present invention.

[0018] Figure 6 This is a detailed flowchart illustrating the backup check and recovery steps of the present invention. Detailed Implementation

[0019] The technical solutions of the present invention will be clearly and completely described below with reference to the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention.

[0020] like Figures 1 to 6 This illustrates a method for backing up electronic files of geological data.

[0021] To make the objectives, technical solutions, and advantages of this invention clearer, a method for backing up electronic geological data proposed by this invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. Those skilled in the art will understand that the embodiments described herein are merely illustrative of the invention and are not intended to limit the scope of protection of this invention.

[0022] It should be noted that in the description of this invention, the terms "first," "second," etc., are used only for distinguishing descriptions and should not be construed as indicating or implying relative importance. The term "comprising" and any variations thereof are intended to cover non-exclusive inclusion.

[0023] This invention provides a method for backing up electronic geological data files. Its core lies in achieving reliable protection of massive, multi-format, and high-value electronic geological data files through intelligent strategy management, efficient data deduplication and segmentation, and multi-layered security mechanisms. The following section combines... Figures 1 to 6 The specific embodiments of the present invention will be described in detail below.

[0024] Example This embodiment uses data generated from a geological survey project as an example to illustrate in detail the complete process of backing up data using the method of this invention.

[0025] like Figure 1 As shown, the method of the present invention mainly includes the following steps: S1 dynamic backup strategy determination, S2 content-based redundancy deduplication, S3 file segmentation and reassembly, S4 file storage and security, and S5 backup check and recovery.

[0026] S1: Steps for Determining Dynamic Backup Strategy This step aims to develop and implement differentiated backup strategies based on the characteristics of geological data and actual operational needs, such as... Figure 2 As shown.

[0027] First, the data is classified and categorized. The system receives the electronic geological data files to be backed up and automatically identifies and classifies them based on their metadata attributes, such as confidentiality level, data size, and intended backup medium. For example, it performs protective classification and backup based on the confidentiality level of the electronic geological data files, prioritizes backup based on the data size (smaller data size takes precedence), and classifies backup based on the intended backup medium, prioritizing hard drives, tape, and optical discs for timely synchronization.

[0028] Secondly, dynamic strategy matching is performed. The system has a preset strategy library to match corresponding backup strategies for different categories and levels of data.

[0029] Next, the backup media strategy is executed. The scheduler prioritizes hard disk backup tasks according to the pre-defined strategy. Simultaneously, the system, based on preset task schedules (e.g., weekly or monthly), reminds users to start the tape drive and optical disc burner, rewriting the data backed up to the hard disk onto the tape and optical disc, thus achieving heterogeneous backup.

[0030] Then, policy execution scheduling is performed. To improve backup efficiency, such as... Figure 3 As shown, the scheduler employs a pipelined architecture. When a backup task starts, the system first analyzes the backup content and pre-calculates the estimated execution time for three types of sub-tasks: "read" (reading data from the source), "calculate" (calculating hash values ​​and dividing data into blocks), and "write" (writing to the backup medium). Based on the pre-calculation results, the scheduler automatically allocates system resources to different data streams, enabling "read," "calculate," and "write" operations to be processed in parallel, avoiding resource idleness and maximizing the utilization of I / O and computing power. For example, while the system is reading file A, it can simultaneously calculate the hash value of file B, which has already been read into memory, and write the calculated data block C to the disk.

[0031] Finally, the system determines the file optimization strategy. For large files exceeding 50GB, the system marks them and proceeds to step S3 for file splitting. For files smaller than 50GB and the resulting sub-files, the system directly proceeds to step S2, using a hash-based approach for redundancy removal.

[0032] S2: Content-based redundancy deduplication steps This step aims to identify and eliminate duplicate data, significantly saving storage space, such as... Figure 4 As shown.

[0033] First, hash calculation and indexing are performed. For each electronic geological data file or each segmented data block, the system uses both MD5 and SM3 hash algorithms to calculate its hash value, generating a composite hash value. Simultaneously, the system also calculates a CRC32 cyclic redundancy check value for each electronic geological data file or data block. This composite hash value, along with the CRC32 value, is stored as a unique fingerprint of the data block in the global hash index library. The combination of MD5 and SM3 effectively avoids the collision risk of a single hash algorithm, greatly enhancing the uniqueness and collision resistance of the fingerprint; CRC32 is used for subsequent rapid data integrity verification.

[0034] Finally, deduplication is performed. For newly generated data blocks, the system searches and compares their composite hash values ​​against the global hash index.

[0035] Duplicate Blocks: If the same compound hash value already exists in the index, the data block is determined to be a duplicate block, and the identical content is confirmed through binary comparison. For identical content, the system no longer stores the actual content of the data block, but only records a reference pointer to the already stored data block in the metadata of this backup.

[0036] Unique Block: If the hash value does not exist in the index, it is considered a unique block. The system compresses the data block and stores it in the backup storage pool, then adds its hash value and CRC32 value to the index, completing the data entry.

[0037] S3: File Splitting and Reassembly Steps This step is specifically designed for handling very large files to optimize transfer and storage efficiency.

[0038] First, large files are split. When geological data files exceeding 50GB are identified in step S1, the system automatically splits them into multiple 10GB sub-files. This size selection balances transfer efficiency, storage media compatibility, and flexibility for subsequent recovery.

[0039] Secondly, metadata is recorded. During the splitting process, the system generates a separate split metadata file. This file details the original file's name, total size, modification time, total number of split blocks (N blocks), the name (or ID) of each sub-file, the logical order of each sub-file within the original file (1..N), and the hash value of each sub-file. This metadata file is crucial for accurately and completely reconstructing the original file later.

[0040] S4: File Storage and Security Steps This step provides comprehensive security for the backup data, such as... Figure 5 As shown.

[0041] First, protect the electronic files or data blocks of geological data. Randomly select 2-3 data blocks based on the size of the electronic files of geological data for encryption and replacement, and record the relevant encryption block information, including but not limited to the encryption file identifier, total size, hash before encryption, hash after encryption, encryption position offset, and encryption data size.

[0042] Secondly, metadata protection is implemented. All critical metadata, including file directory information, file attributes, segmented metadata generated in step S3, and the global hash index in step S2, is encrypted and stored using the SM4 block cipher algorithm approved by the State Cryptography Administration. Even if the storage medium is stolen or lost, attackers without the key cannot decipher the file's directory structure and the relationships between data blocks, and therefore cannot recover the original file.

[0043] Secondly, data access control is implemented. The system adopts a role-based access control model. For example, roles such as "Backup Administrator," "Recovery Operator," "Auditor," and "Regular User" are defined. The Backup Administrator can configure policies and manage storage pools; the Recovery Operator only has the permission to perform data recovery; the Auditor can view audit logs; and regular users can only query data within their authorized scope. Staff in different business processes can only access the specific data required for their work and perform specific operations, effectively preventing unauthorized data access.

[0044] Next, integrity verification and tamper protection are performed. The system calculates and stores an additional HMAC (Hash-Based Message Authentication Code) value for each stored data block and each metadata record. Simultaneously, the system periodically initiates background scanning tasks to recalculate the hash values ​​of existing data blocks and metadata and compare them with previously stored HMAC values. Any data tampering, whether intentional or unintentional (e.g., bad sectors on the media), will result in a hash value mismatch, immediately triggering an alarm and pinpointing the tampered file or data block.

[0045] Finally, audit logs are generated. The system automatically and comprehensively records all critical operations, including but not limited to: who, when, from which IP address, what operation was performed (e.g., backup, restore, deletion, configuration modification), what the operation was, and the result. These logs are also encrypted and protected against tampering, forming a complete and non-repudiable audit chain to meet the needs of data compliance and post-event traceability.

[0046] S5: Backup Check and Restore Steps This step is the last line of defense to ensure the long-term availability of backup data, such as... Figure 6 As shown.

[0047] First, a backup verification process is performed. After each backup task is completed, the system does not immediately terminate; instead, it automatically triggers the verification process.

[0048] Hash verification: The system randomly selects a certain proportion of data blocks from the backup storage, recalculates their hash values, and compares them with the values ​​recorded in the metadata to ensure that the data is not corrupted during the writing process.

[0049] Random sampling recovery: The system randomly selects several files or data blocks and attempts a complete recovery exercise, restoring the data to a temporary area and verifying whether the recovered files can be opened and read normally. This ensures the logical recoverability of the data.

[0050] Secondly, backup checks are performed. In accordance with regulations such as the "Electronic Document Archiving and Management Measures," the system conducts a 10% sampling check of all backup data annually. Specific check methods are used for different media. Optical disc backup: The data on the optical disc is read using a sampling readback method and compared with the original data to check for data errors caused by optical disc aging or scratches.

[0051] Tape backup: Using a sampling recovery reading method, the data in the tape is restored to a temporary area on the disk and its integrity is verified. It is also checked whether the data is unreadable due to moisture, demagnetization, or other reasons.

[0052] Hard drive backup: SMART (Self-Monitoring, Analysis and Reporting Technology) is used to assess the health of the hard drive, and hash check is used to check data integrity.

[0053] Then, a backup and recovery process is performed. If problems are detected during routine verification, annual checks, or actual data loss, the system will perform data recovery. During recovery, the system follows a core principle: non-recovery from the same media. For example, if a data block on a hard drive is found to be corrupted, the system will not read it from a copy of another hard drive, but will instead prioritize reading the data block from tape or optical disc to reduce the risk of relying on media from the same batch and improve the success rate of data recovery.

[0054] Finally, an inspection report is generated. After each inspection, the system automatically generates a detailed backup inspection report, which includes: inspection time, inspector, the inspected backup media number, the items inspected (hash check / SMART), inspection results (normal / abnormal), and a detailed description of any issues found and recommended solutions. This report provides a basis for ensuring the compliance and traceability of data management.

[0055] In summary, this invention, through the coordinated operation of the five steps described above, constructs a closed-loop backup management system for electronic geological data files. It not only solves the problems of low storage efficiency and slow recovery in traditional backup methods, but also ensures the long-term security, reliability, and availability of geological data—a fundamental national data resource—through refined strategies, intelligent deduplication and segmentation, and multi-layered security protection.

[0056] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A method of backing up electronic files of geological data, characterized by, Includes the following steps: S1: Steps for determining the dynamic backup strategy: Geological data electronic files are classified and graded according to their confidentiality, data size, and intended backup media, and differentiated backup strategies are set for different levels of geological data electronic files and different types of backup media. According to the backup strategy, hard disk backups should be performed first, and tape and optical disc backups should be performed simultaneously according to the work task schedule. Based on the pre-calculation of the execution time of three types of tasks—read, write, and backup—a pipelined structure is used to automatically schedule backup tasks. For files exceeding 50GB, an automatic splitting strategy is used; for other files and the split data, a hash mode is used to remove redundancy and duplicates. S2: Content-based redundancy deduplication steps: The electronic files of geological data are divided into blocks according to their size. Geological data files exceeding 50GB are automatically split into multiple sub-files of 10GB each. A global hash index is established by calculating a strong hash value for each data block using the MD5+SM3 algorithm and supplementing it with a CRC32 value. The hash value of the new data block is compared with the index. If the same hash value exists, it is determined to be a duplicate block, and only the reference pointer is recorded in the backup metadata. If it does not exist, the data block is stored and the hash index is updated. S3: File splitting and reassembly steps: For undivided electronic geological data files, generate and store basic metadata, recording the original file name, total size, and file hash value; For the electronic files of geological data that have been segmented, generate and store segmentation metadata, recording the original file name, total size, number of segmented blocks, hash value of each block and its order; S4: File Storage and Security Steps: Based on the size of the electronic file of the geological data, randomly select 2-3 data blocks for encryption and replacement, and record the relevant encryption block information; Based on the SM4 algorithm, the catalog information, segmentation information, hash index information, and encrypted data block information of electronic geological data files are encrypted and stored. Manage access permissions for backup data based on a role- and user-based access control model; Calculate and store hash values ​​for each backed-up data block and metadata record, and perform integrity and originality scans periodically; Record all backup, restore, deletion, configuration change and other operations, generate audit logs and store them in encrypted form; S5: Backup check and restore steps: After each backup task is completed, backup verification is automatically performed, including hash verification and random sampling recovery verification; Each year, 10% of all backup data is sampled and checked. Optical discs are checked by readback, magnetic tapes are checked by recovery read, and hard drives are checked by SMART detection plus hash verification. For the problems found during the inspection, data recovery was performed using a backup and recovery method based on different media. A backup inspection report is generated based on the inspection findings, recording the inspection time, media number, inspection status, and inspector information.

2. The method of claim 1, wherein, In the dynamic backup strategy determination step, the file optimization strategy further includes: using an automatic splitting strategy for files exceeding 50GB, and using a hash mode to remove redundancy for other files and the split data.

3. The method of claim 1, wherein, In the content-based redundancy deduplication step, the hash calculation and indexing further includes: calculating the hash value using the MD5+SM3 algorithm, and establishing a global hash index library after supplementing it with the CRC32 value.

4. The method of claim 1, wherein, In the file segmentation and reorganization steps, the large file segmentation further includes: automatically segmenting geological data files exceeding 50GB into multiple sub-files of 10GB each, and supplementing any parts of the final file that are less than 10GB.

5. The method of claim 1, wherein, In the file storage and security steps, the metadata protection further includes: randomly selecting 2-3 data blocks based on the size of the geological data electronic file and encrypting and replacing them; and encrypting and storing directory information, segmentation information, hash index information and encrypted data block information based on the SM4 algorithm.

6. The method according to claim 1, characterized in that, The backup check and recovery steps further include: conducting a 10% sampling check on all backup data annually, and adopting differentiated check methods for different media.