Data deduplication method, device, equipment, storage medium and program

By using a hierarchical fingerprint processing mechanism to filter and verify data blocks, the problem of high computational complexity or high false positive rate of hash algorithms is solved, and an efficient and reliable data deduplication process is achieved.

CN122308728APending Publication Date: 2026-06-30BEIJING KECHENG TECH DEV CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING KECHENG TECH DEV CO LTD
Filing Date
2026-03-16
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing technologies, hash algorithms have high computational complexity or a high probability of misjudgment during data deduplication, resulting in low efficiency of online deduplication.

Method used

A hierarchical fingerprint processing mechanism is adopted. The first-level fingerprint index table is used to filter non-repeating data blocks, the anchor index table is used for local feature verification, and the second-level fingerprint index table is used for final confirmation, thereby reducing high-overhead computation.

Benefits of technology

It improves the efficiency and reliability of online deduplication, reduces computational overhead and false positive rate, and adapts to the performance and security balance of different business needs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122308728A_ABST
    Figure CN122308728A_ABST
Patent Text Reader

Abstract

This application provides a data deduplication method, apparatus, device, storage medium, and program. The method includes: segmenting a target data stream into multiple data blocks to be stored; determining candidate data blocks from the multiple data blocks based on a first-level fingerprint index table; determining a set of anchor point feature values ​​corresponding to the candidate data blocks, and performing duplicate verification on the candidate data blocks using the anchor point index table according to the set of anchor point feature values, obtaining a first verification result; if the first verification result indicates that the candidate data block is non-duplicate data, then performing duplicate verification on the candidate data block using a second-level fingerprint index table, obtaining a second verification result; and marking candidate data blocks where the first verification result indicates duplicate data or the second verification result indicates duplicate data for deduplication. The method of this application can improve the efficiency of online deduplication.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and in particular to a data deduplication method, apparatus, device, storage medium, and program. Background Technology

[0002] In today's era of explosive data growth, the process of synchronizing massive amounts of data across data centers, sharing storage among multiple tenants, and archiving real-time data often results in the duplicate storage of identical or highly similar data blocks. Online deduplication is used to remove duplicates before data is written, thus avoiding wasted storage resources.

[0003] In existing technologies, fingerprint calculations can be performed on data blocks using hash algorithms (such as cryptographic hashing or uncryptographic hashing) to obtain the fingerprint value of each data block, and deduplication can be achieved by comparing the fingerprint values. Cryptographic hashing has the advantage of strong collision resistance, but its high computational complexity leads to high CPU resource consumption; while uncryptographic hashing is fast, it has a high probability of false positives, requiring additional byte-level verification or secondary hash calculations to reduce the risk of false positives, thus introducing additional performance overhead and making online deduplication inefficient. Summary of the Invention

[0004] This application provides a data deduplication method, apparatus, device, storage medium, and program to solve the technical problem of low efficiency in online deduplication.

[0005] Firstly, this application provides a data deduplication method, including:

[0006] The target data stream is segmented into multiple data blocks to be stored.

[0007] Based on the first-level fingerprint index table, candidate data blocks are determined from the plurality of data blocks to be stored;

[0008] Determine the set of anchor feature values ​​corresponding to the candidate data block, and perform repeatability verification on the candidate data block according to the set of anchor feature values ​​corresponding to the candidate data block through the anchor index table to obtain the first verification result;

[0009] If the first verification result indicates that the candidate data block is non-repeating data, then the candidate data block is subjected to repeatability verification through the second-level fingerprint index table to obtain the second verification result;

[0010] The candidate data blocks for which the first verification result is duplicate data or the second verification result is duplicate data are marked for deduplication.

[0011] In this embodiment, the first-layer fingerprint index table screening stage intercepts most non-duplicate data blocks, reducing the amount of candidate data for subsequent high-overhead processing; the anchor index table performs duplicate verification on candidate data blocks, and the second-layer fingerprint index table strong fingerprint verification stage only performs cryptographic hash calculation on a small number of candidate data blocks, which can improve the efficiency of online deduplication.

[0012] Optionally, determining the set of anchor feature values ​​corresponding to the candidate data block includes:

[0013] Based on the local entropy value, the proportion of consecutive repeating segments, and the semantic feature type of each sub-region data corresponding to the candidate data block, the anchor point adaptation value corresponding to each sub-region data is determined. The anchor point adaptation value is used to indicate the degree of adaptation between the sub-region data and the anchor point of the candidate data block.

[0014] Based on the anchor adaptation values ​​corresponding to the data in each sub-region, multiple target anchor data corresponding to the candidate data block are determined in the multiple sub-region data, and the multiple target anchor data are hashed to obtain the anchor feature value set.

[0015] In this embodiment, high-discrimination regions in candidate data blocks can be accurately identified, effectively improving the uniqueness and discriminability of the anchor point feature value set, thereby reducing the misjudgment rate in the anchor point verification stage and improving the reliability of online deduplication.

[0016] Optionally, for any sub-region of data; based on the local entropy value, the proportion of consecutive repeating segments, and the semantic feature type of the sub-region of data, determine the anchor point adaptation value corresponding to the sub-region of data, including:

[0017] The sub-region data is processed by information entropy to obtain the local entropy value corresponding to the sub-region data.

[0018] The ratio of the total length of consecutive identical bytes in the sub-region data to the total length of the sub-region is determined as the proportion of consecutive repeating segments in the sub-region data.

[0019] By pre-setting semantic feature rules, the semantic feature type corresponding to the sub-region data is determined, and the semantic feature type includes semantic feature regions and non-semantic feature regions;

[0020] Based on the anchor point adaptation rules, the local entropy value, the proportion of continuous repeating segments, and the semantic feature type of the sub-region data are processed to obtain the anchor point adaptation value.

[0021] In this embodiment, by extracting local entropy values, the proportion of continuous repeating segments, and semantic feature type features from candidate data blocks by region, and combining them with anchor point adaptation rules based on weighted scoring, the anchor point adaptation value is quantitatively calculated, so that the selection of anchor point extraction regions has a clear quantitative basis, thereby improving the reliability of online deduplication.

[0022] Optionally, the anchor point adaptation rule includes multiple adaptation sub-rules; based on the anchor point adaptation rule, data processing is performed on the local entropy value, the proportion of consecutive repeating segments, and the semantic feature type of the sub-region data to obtain the anchor point adaptation value, including:

[0023] Determine the service type and current load status corresponding to the target data stream;

[0024] Based on the business type and the current load status, determine the target adaptation sub-rule from the plurality of adaptation sub-rules;

[0025] Based on the target adaptation sub-rules, the local entropy value, the proportion of continuous repeating segments, and the semantic feature type of the sub-region data are processed to obtain the anchor point adaptation value.

[0026] In this embodiment, by pre-setting multiple sets of differentiated anchor point adaptation sub-rules, the calculation of anchor point adaptation values ​​is no longer limited to fixed rules. This can reduce computational overhead in business scenarios with high load and low latency requirements, ensure the processing performance and throughput of the entire data deduplication process, and improve the accuracy of anchor point extraction.

[0027] Optionally, based on the anchor point adaptation values ​​corresponding to each sub-region data, multiple target anchor point data corresponding to the candidate data block are determined from the multiple sub-region data, including:

[0028] Based on the anchor point adaptation value corresponding to each sub-region data, multiple candidate anchor point data corresponding to the candidate data block are determined. The multiple candidate anchor point data are in different regions of the candidate data block, and the anchor point adaptation value of the candidate anchor point data is greater than a preset adaptation value.

[0029] The number of anchor points for the multiple target anchor point data is determined based on the total length of the candidate data blocks.

[0030] Based on the number of anchor points, the target anchor point data is obtained from the candidate anchor point data.

[0031] In this embodiment, the anchor point extraction region can be made more accurate and adaptable by quantitative filtering and dynamic quantity adjustment based on the anchor point adaptation value, thereby improving the distinguishability and accuracy of the anchor point verification stage and reducing the system computational overhead.

[0032] Optionally, the target data stream can be segmented to obtain multiple data blocks to be stored, including:

[0033] Based on a preset sliding window, determine the local section repetition rate and hash value distribution dispersion of the target data stream;

[0034] Based on the local section repetition rate and hash value distribution dispersion, the preset sliding window and preset segmentation boundary threshold are adjusted respectively to obtain the adjusted sliding window and adjusted segmentation boundary threshold.

[0035] Based on the adjustment of the sliding window and the adjustment of the segmentation boundary threshold, the target data stream is segmented to obtain the multiple data blocks to be stored.

[0036] In this embodiment, a dynamic parameter adaptation mechanism based on local features of the data stream is used to achieve accurate matching between block granularity and data content features, thereby improving the duplicate data identification rate and overall deduplication efficiency, while reducing the risk of deduplication failure caused by local changes in the data.

[0037] Optionally, the first-layer fingerprint index table includes first fingerprint values ​​of multiple stored data blocks; based on the first-layer fingerprint index table, determining candidate data blocks among the multiple data blocks to be stored includes:

[0038] Determine the first fingerprint value of each of the plurality of data blocks to be stored;

[0039] For any data block to be stored, the first fingerprint value of the data block to be stored is matched with the first fingerprint values ​​of the plurality of stored data blocks to obtain the matching result corresponding to the data block to be stored.

[0040] The storage data block that is successfully matched among the multiple data blocks to be stored is determined as the candidate data block.

[0041] In this embodiment, by quickly matching and filtering the first-layer fingerprint index table after segmentation, most non-duplicate data blocks can be quickly intercepted with extremely low computational and time overhead. Only the successfully matched suspected duplicate data blocks are identified as candidate data blocks, reducing the amount of candidate data for anchor verification and strong fingerprint verification, and avoiding meaningless high-overhead computation that occupies system resources.

[0042] Secondly, embodiments of this application provide a data deduplication device, including a block processing module, a first determining module, a second determining module, a first verification module, a second verification module, and a deduplication marking module:

[0043] The block processing module is used to perform block processing on the target data stream to obtain multiple data blocks to be stored.

[0044] The first determining module is used to determine candidate data blocks among the plurality of data blocks to be stored based on the first-layer fingerprint index table;

[0045] The second determining module is used to determine the set of anchor point feature values ​​corresponding to the candidate data block;

[0046] The first verification module is used to perform repeatability verification on the candidate data block according to the anchor point feature value set corresponding to the candidate data block and through the anchor point index table to obtain a first verification result;

[0047] The second verification module is used to, if the first verification result indicates that the candidate data block is non-repeating data, perform repeatability verification on the candidate data block through the second-layer fingerprint index table to obtain a second verification result;

[0048] The deduplication marking module is used to deduplicatize candidate data blocks whose first verification result is duplicate data or whose second verification result is duplicate data.

[0049] Thirdly, embodiments of this application provide an electronic device, including: a processor, and a memory communicatively connected to the processor;

[0050] The memory stores computer-executed instructions;

[0051] The processor executes computer execution instructions stored in the memory to implement the method as described in any of the first aspects.

[0052] Fourthly, embodiments of this application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the method described in the first aspect.

[0053] Fifthly, this application provides a computer program product, including a computer program that, when executed by a processor, implements the method described in any of the first aspects.

[0054] The data deduplication method, apparatus, device, storage medium, and program provided in this application can segment a target data stream to obtain multiple data blocks to be stored; based on a first-layer fingerprint index table, candidate data blocks are determined from the multiple data blocks to be stored; the set of anchor point feature values ​​corresponding to the candidate data blocks is determined, and based on the set of anchor point feature values ​​corresponding to the candidate data blocks, the candidate data blocks are verified for duplication through the anchor point index table to obtain a first verification result; if the first verification result indicates that the candidate data block is non-duplicate data, then the candidate data block is verified for duplication through a second-layer fingerprint index table to obtain a second verification result; and candidate data blocks whose first verification result indicates duplicate data or whose second verification result indicates duplicate data are marked for deduplication. Through the first-layer fingerprint index table screening stage, most non-duplicate data blocks are intercepted, reducing the amount of candidate data for subsequent high-overhead processing; through the anchor point index table, the candidate data blocks are verified for duplication, and in the second-layer fingerprint index table strong fingerprint verification stage, only a small number of candidate data blocks are subjected to encrypted hash calculations, which can improve the efficiency of online deduplication. Attached Figure Description

[0055] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.

[0056] Figure 1 A schematic diagram illustrating the application scenarios provided in the embodiments of this application;

[0057] Figure 2 A flowchart illustrating a data deduplication method provided in an embodiment of this application;

[0058] Figure 3 This application provides a schematic diagram illustrating the changes in data.

[0059] Figure 4 A flowchart illustrating another data deduplication method provided in an embodiment of this application;

[0060] Figure 5 This is a deployment diagram of a data deduplication architecture provided in an embodiment of this application;

[0061] Figure 6 A schematic diagram of the architecture of a data deduplication method provided in an embodiment of this application;

[0062] Figure 7 This is a schematic diagram of the structure of a data deduplication device provided in an embodiment of this application;

[0063] Figure 8 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application.

[0064] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation

[0065] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0066] Figure 1 This is a schematic diagram illustrating an application scenario provided in an embodiment of this application. Please refer to [link / reference]. Figure 1 This application is applicable to scenarios such as enterprise-level backup storage, cloud computing, real-time data archiving, and distributed storage systems, including user equipment 101, server 102, and storage device 103. After receiving the target data stream from user equipment 101, server 102 determines multiple data blocks to be stored corresponding to the target data stream. Based on the first-level fingerprint index table (i.e., L1 weak fingerprint index table), it can determine candidate data blocks among the multiple data blocks to be stored. The candidate data blocks are suspected duplicate data blocks.

[0067] Server 102 can determine the set of anchor feature values ​​corresponding to candidate data blocks, and perform duplicate verification on the candidate data blocks according to the set of anchor feature values ​​(i.e., L2 anchor index table) to obtain a first verification result. If the first verification result of the candidate data block is non-duplicate data, then the candidate data block is performed on the second-level fingerprint index table (i.e., L3 strong fingerprint index table) to obtain a second verification result. The candidate data blocks whose first verification result is duplicate data or whose second verification result is duplicate data are marked for deduplication, and the other data blocks to be stored are stored in storage device 103. Server 102 can be an application server, edge node, etc.

[0068] In related technologies, fingerprint calculations can be performed on data blocks using hash algorithms (such as cryptographic hashing or uncryptographic hashing) to obtain the fingerprint value of each data block, and deduplication can be achieved by comparing the fingerprint values. Cryptographic hashing has the advantage of strong collision resistance, but its high computational complexity leads to high CPU resource consumption; while uncryptographic hashing is fast, it has a high probability of false positives, requiring additional byte-level verification or secondary hash calculations to reduce the risk of false positives, thus introducing additional performance overhead and making online deduplication inefficient.

[0069] The data deduplication method provided in this application intercepts most non-duplicate data blocks in the first-level fingerprint index table screening stage, reducing the amount of candidate data for subsequent high-overhead processing; it performs duplicate verification on candidate data blocks through the anchor index table, and performs cryptographic hash calculation on only a small number of candidate data blocks in the second-level fingerprint index table strong fingerprint verification stage, which can improve the efficiency of online deduplication.

[0070] The technical solution of this application and how the technical solution of this application solves the above-mentioned technical problems are described in detail below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of this application will now be described with reference to the accompanying drawings.

[0071] Figure 2 This is a flowchart illustrating a data deduplication method provided in an embodiment of this application. Please refer to [link / reference]. Figure 2 The method may include:

[0072] S201. The target data stream is segmented to obtain multiple data blocks to be stored.

[0073] The execution entity of this application embodiment can be a server or a data deduplication device installed on the server. The data deduplication device can be implemented by software or by a combination of software and hardware.

[0074] The target data stream can refer to the raw data stream, which is the raw binary data stream that has not been processed, compressed, encrypted, or formatted. It is the initial input object of the data deduplication process.

[0075] Examples include user-uploaded complete file streams, real-time database backup streams, virtual machine disk image streams, raw data packet streams transmitted over the network, and raw audio and video acquisition data streams.

[0076] Multiple data blocks to be stored are unequal-length data blocks dynamically determined by the Content-Defined Chunking (CDC) algorithm.

[0077] Content-Defined Chunking (CDC) is a dynamic chunking algorithm based on data content characteristics. It can traverse the data stream through a sliding window, calculate the hash value of the data within the window, and determine the current position as the chunking boundary when the hash value meets the preset chunking boundary determination rules (such as the last N bits of the hash value being 0), thereby generating data blocks of unequal length.

[0078] S202. Based on the first-level fingerprint index table, determine candidate data blocks among multiple data blocks to be stored.

[0079] The first-level fingerprint index table is a high-speed hash index structure in memory (L1 fingerprint index table), which is used to store lightweight fingerprint information of stored data blocks to achieve high-speed query and comparison.

[0080] The first-level fingerprint index table may include the first fingerprint values ​​of multiple stored data blocks, as well as the mapping relationship between the first fingerprint values ​​and the corresponding data block metadata (such as storage address and reference count).

[0081] The first fingerprint value refers to the weak fingerprint, which is a short-length hash value generated by a non-cryptographic hash algorithm with low computational overhead and high throughput. It is used for quick preliminary screening of data blocks to be stored.

[0082] The first fingerprint value is obtained by performing a lightweight, unencrypted hash calculation on each data block. Specifically, a high-speed unencrypted hash function, such as xxHash or MurmurHash, is used to traverse all bytes of the data block and calculate a fixed-length hash value as the first fingerprint value.

[0083] Candidate data blocks are data blocks whose first fingerprint value matches (hit) an existing first fingerprint value in the first-level fingerprint index table from the data blocks to be stored.

[0084] If a candidate data block is determined to be a suspected duplicate data block, it needs to enter the subsequent anchor feature verification and strong fingerprint verification stage; if the first fingerprint value is not matched in the first-level fingerprint index table, the data block to be stored is determined to be a non-duplicate data block and directly enters the storage writing process.

[0085] S203. Determine the set of anchor feature values ​​corresponding to the candidate data block, and perform repeatability verification on the candidate data block through the anchor index table based on the set of anchor feature values ​​corresponding to the candidate data block to obtain the first verification result.

[0086] Anchor point features are local features that are highly discriminative extracted from data blocks. They can be generated by extracting fixed-length byte sequences from high-quality feature regions (high-entropy regions, semantic feature regions, etc.) of candidate data blocks and using lightweight hash algorithms (such as CRC32) to verify the repeatability of data blocks at low cost and high precision.

[0087] Examples include Content-Defined Chunking (CDC) boundary points, specific byte samples, key field fragments of structured data, and feature segments of unstructured data (such as image header information).

[0088] Anchor index tables are high-speed index structures used to store the mapping relationship between anchor feature values ​​of stored data blocks and corresponding data block metadata (such as storage address and global features). They are usually deployed in memory and support millisecond-level fast query and comparison.

[0089] The first verification result includes candidate data blocks that are non-duplicate data and candidate data blocks that are duplicate data.

[0090] If the set of anchor feature values ​​corresponding to a candidate data block matches the anchor index table, the first verification result is that the candidate data block is duplicate data. Otherwise, the first verification result is that the candidate data block is non-duplicate data.

[0091] S204. If the first verification result of the candidate data block is that the candidate data block is non-repeating data, then the candidate data block is verified for repetition through the second-level fingerprint index table to obtain the second verification result.

[0092] If the first verification result of the candidate data block is that the candidate data block is non-repeating data, that is, the second layer screening of the candidate data block channel may be non-repeating data. It is necessary to perform repeatability verification again through the second layer fingerprint index table, which can improve the reliability of verification and avoid missed judgments or false judgments.

[0093] The second-level fingerprint index table (i.e., the L3 strong fingerprint index table) is a global index structure used to store the mapping relationship between the second fingerprint value and data block metadata (such as storage address and reference count). It supports a hot and cold tiered storage strategy (the strong fingerprints of hot data are stored in memory, and the strong fingerprints of cold data are stored on high-speed disks), balancing query efficiency and storage cost.

[0094] The hot end index can store the strong fingerprints and corresponding metadata of the hot data accessed in the last 30 days in a distributed memory cluster, enabling high-speed query and matching at the millisecond level and meeting the low latency requirements in high-concurrency scenarios.

[0095] The cold end index is a strong fingerprint and corresponding metadata of cold data that has not been accessed for more than 30 days. It is persistently stored in a high-speed SSD disk array or object storage. The query performance is optimized by index partitioning and prefix hash routing, which reduces storage costs while ensuring the accessibility of cold data index.

[0096] The dynamic hot and cold index scheduling feature has a built-in heat perception module. By monitoring the query frequency and access time of strong fingerprints in real time, it automatically migrates cold-end indexes with increased popularity to hot-end memory and sinks hot-end indexes with decreased popularity to cold-end disks, realizing dynamic allocation and efficient utilization of index resources, while balancing query efficiency and storage costs.

[0097] The second fingerprint value refers to the strong fingerprint, a data block identifier generated using a cryptographic hash algorithm, used to ultimately confirm the uniqueness of the data block. Examples include SHA-256 and SM3.

[0098] Calculate a strong fingerprint (second fingerprint value) for the complete byte content of the candidate data block, and check if the same second fingerprint value exists in the second-level fingerprint index table; if it matches, the second verification result is duplicate data; if it does not match, the second verification result is non-duplicate data.

[0099] S205. Deduplicatize candidate data blocks whose first verification result is duplicate data or whose second verification result is duplicate data.

[0100] After verification using the L2 anchor index table, if the first verification result of a candidate data block is duplicate data, the data block can be directly deduplicated without entering the third layer strong fingerprint verification.

[0101] Deduplication tagging refers to not physically storing candidate data blocks, but updating the reference count of the corresponding already stored data block (recording the number of times the data block has been reused). When accessing the candidate data block later, the reference count points to the already stored data block, thus achieving data block reuse and saving storage space.

[0102] If the first verification result is non-duplicate data while the second verification result is duplicate data, then the candidate data block can be identified as duplicate data, the data block can be deduplicated, and its second fingerprint value and anchor feature value set can be added to the third-level fingerprint index table and anchor index table to improve the index data.

[0103] If both the first and second verification results indicate that the candidate data block is non-repeating, then the candidate data block can be identified as non-repeating data, stored in a storage device, and its first fingerprint value, anchor point feature value set, and second fingerprint value can be synchronously written into the first-level fingerprint index table, anchor point index table, and third-level fingerprint index table, respectively, to provide index support for subsequent data block verification.

[0104] The data deduplication method provided in this application can achieve a dynamic balance between performance and security through a layered fingerprint processing mechanism (weak fingerprint - L1, anchor feature - L2, strong fingerprint - L3). The L1 weak fingerprint screening stage intercepts most non-duplicate data blocks through fast hash calculation, reducing the amount of candidate data for subsequent high-overhead processing. The L2 anchor feature verification stage further reduces the false positive rate through local feature value matching and byte-level verification, avoiding unnecessary strong fingerprint calculations. The L3 strong fingerprint verification stage performs encrypted hash calculations only on a small number of candidate data blocks, ensuring data consistency. In performance-priority scenarios, this significantly improves data throughput; in security-priority scenarios, it ensures data consistency through final strong fingerprint verification. Simultaneously, the layered mechanism adapts to different business needs, achieving a dynamic balance between performance and security, and improving the efficiency and accuracy of online deduplication.

[0105] Figure 3 This is a schematic diagram illustrating a data change as provided in an embodiment of this application. Please refer to [link / reference]. Figure 3 The data stream layer consists of a complete raw binary data stream (e.g., "aabbbccccccddddxxxxxxxxxxxxxxxxxxxxxxxxxx" in the example), which serves as the initial input for hierarchical processing. It contains repetitive feature regions (e.g., consecutive "a", "b", "c" segments) and highly heterogeneous regions (e.g., consecutive "x" segments). In the L1 layer (weak fingerprint filtering layer), content definition chunking (CDC) is performed on the raw data stream, generating data blocks of varying lengths to be stored (e.g., in the example, divided into variable-length blocks like "aabbbccccc"). Then, weak fingerprint calculation is used to compare these blocks with the L1 index table, filtering out suspected duplicate candidate data blocks. Only these candidate blocks are allowed to proceed to subsequent high-overhead verification, while most non-duplicate data blocks are blocked.

[0106] In layer L2 (anchor verification layer), for candidate data blocks output from layer L1, high-discrimination anchor feature values ​​are dynamically extracted based on features such as local entropy and byte distribution (e.g., extracting local feature segments such as "a", "a", "bbb", and "c" from "aabbbccccc" in the example). Multi-anchor collaborative matching is then performed using the anchor index table to further verify the data block redundancy and reduce the number of candidates entering layer L3. In layer L3 (strong fingerprint verification layer), only candidate data blocks that fail layer L2 verification are processed to calculate a strong fingerprint of their complete content (e.g., calculating strong fingerprints such as "aa" and "bb" for local segments in the example). The final uniqueness verification is then completed using the L3 index table to ensure data consistency.

[0107] Figure 4 This is a flowchart illustrating another data deduplication method provided in an embodiment of this application. Please refer to... Figure 4 The method may include:

[0108] S401. Determine the local section repetition rate and hash value distribution dispersion of the target data stream according to the preset sliding window.

[0109] The preset sliding window refers to the fixed-length byte window (e.g., 16B) initially set by the CDC algorithm, which is used to scan the target data stream segment by segment, providing a basic granularity for feature calculation.

[0110] The preset sliding window slides continuously along the byte sequence of the target data stream, with each sliding step being the same as the window length (no overlapping scan), ensuring full coverage of the data stream.

[0111] Local byte repetition rate refers to the ratio of the total length of consecutively repeating byte segments and high-frequency similar bytes within the current preset sliding window coverage area to the total length of the window bytes. It is used to quantify the repetition density of local data regions. The higher the ratio, the stronger the repetition of the content in that region, and vice versa.

[0112] Specifically, it can traverse all bytes within the current sliding window, identify segments composed of consecutive identical bytes (such as "aaa" and "bbb" in "aaabbb"); count the total number of bytes in all consecutive repeating segments (denoted as L repetition); calculate the local byte repetition rate: R = L repetition / window length (the value range is 0-1, and the closer the local byte repetition rate is to 1, the higher the local repetition rate).

[0113] Hash value distribution dispersion refers to the variance of hash values ​​over the most recent M (e.g., 10) consecutive sliding windows, used to reflect the regularity of byte distribution.

[0114] Specifically, for each byte within a preset sliding window, a lightweight hash algorithm (such as xxHash32) is used to calculate the window hash value (denoted as ). M is an integer greater than or equal to 1; calculate the average of these M hash values ​​(denoted as H-mean); calculate the dispersion using the variance formula: , Let D be the hash value corresponding to the m-th sliding window, where m is 1, 2, ..., M, and D takes values ​​in the range... The closer D is to 0, the more regular the byte distribution and the higher the repetition.

[0115] The target data stream is traversed byte by byte using a preset sliding window. The local byte repetition rate is calculated in real time for each window. At the same time, the hash values ​​of M consecutive windows are statistically calculated to obtain the hash value distribution dispersion. The two indicators are associated with the position offset of the data stream and stored.

[0116] S402. Based on the local section repetition rate and hash value distribution dispersion, adjust the preset sliding window and preset segmentation boundary threshold respectively to obtain the adjusted sliding window and adjusted segmentation boundary threshold.

[0117] By constraining the adjustment range of the sliding window and the segmentation boundary threshold, we can avoid the computational cost from being too narrow and the feature capture from being too wide. The more stringent the value of the segmentation boundary threshold, the finer the segmentation.

[0118] For example, adjust the window length range of the sliding window. The threshold for slicing boundaries is adjusted from the last 4 bits of the hash value being 0 to the last 7 bits being 0.

[0119] The adjustment step size can be ±8B for each sliding window and ±1 bit for each block boundary threshold, allowing for fine-grained adjustment and avoiding sudden parameter changes.

[0120] The rules for adjusting the sliding window are shown in Table 1.

[0121] Table 1

[0122]

[0123] The block boundary threshold refers to the hash condition used in the CDC algorithm to determine the block boundary (e.g., the boundary is determined when the last 5 bits of the window hash value are 0). The adjustment rule is linked to the sliding window. For high repetition scenarios (R≥60% and D≤0.2): relax the threshold (e.g., adjust from the last 5 bits being 0 to the last 4 bits being 0) to reduce the boundary triggering frequency and achieve duplicate block merging. For low repetition scenarios (R<30% and D>0.8): tighten the threshold (e.g., adjust from the last 5 bits being 0 to the last 6 bits being 0) to increase the boundary triggering frequency and refine the block granularity. For medium feature scenarios: keep the preset threshold unchanged (e.g., "the last 5 bits are 0").

[0124] S403. Based on adjusting the sliding window and adjusting the segmentation boundary threshold, the target data stream is segmented to obtain multiple data blocks to be stored.

[0125] The target data stream can be traversed by adjusting the sliding window. For each byte in the adjusting sliding window, a lightweight hash algorithm (such as xxHash32) is executed to calculate the hash value of the current window. The hash value of the current window is compared with the adjustment block boundary threshold. If the threshold condition is met (such as the last 4 bits of the hash value being 0), the end position of the current window is determined as the block boundary. A data block to be stored is formed from the previous block boundary and the next block boundary.

[0126] Ensure that each generated data block to be stored meets the constraints of minimum block length ≥ 4KB and maximum block length ≤ 64KB. If the block length exceeds the range due to threshold adjustment, automatically fine-tune the boundary threshold (e.g., shorten the number of decision bits by 1) and regenerate the data block.

[0127] In this application, a dynamic parameter adaptation mechanism based on local features of the data stream can be used to achieve accurate matching between the block granularity and the data content features, thereby improving the duplicate data identification rate and overall deduplication efficiency, while reducing the risk of deduplication failure caused by local changes in the data.

[0128] S404. Based on the first-level fingerprint index table, determine candidate data blocks among multiple data blocks to be stored.

[0129] Specifically, the first fingerprint value of each of the multiple data blocks to be stored is determined; for any data block to be stored, the first fingerprint value of the data block to be stored is matched with the first fingerprint values ​​of the multiple stored data blocks to obtain the matching result corresponding to the data block to be stored; among the multiple data blocks to be stored, the storage data blocks whose matching result is a successful match are determined as candidate data blocks.

[0130] If a matching first fingerprint value can be found in the high-speed hash index structure of the first-level fingerprint index table, that is, the weak fingerprint of the data block to be stored is exactly the same as the weak fingerprint of a data block already stored in the index table, then it is determined that the data block to be stored has a possibility of being duplicated, and the matching result is successful; if no matching first fingerprint value is found, then the matching is determined to be unsuccessful, the data block to be stored is a non-duplicate data block, and it directly enters the subsequent storage process without participating in the subsequent high-overhead verification.

[0131] For example, suppose there are three data blocks to be stored, namely data block A, data block B, and data block C. The first fingerprint values ​​are calculated sequentially for the three data blocks as HashA, HashB, and HashC. The three fingerprint values ​​are then queried in the first-level fingerprint index table. If the query result is that HashA matches the first fingerprint value of the data block X already stored in the index table, while HashB and HashC do not match, then data block A is determined to be a successful match and is identified as a candidate data block, proceeding to the subsequent L2 anchor feature verification stage. Data blocks B and C fail to match and are directly determined to be non-duplicate data blocks, not requiring further verification. The storage operation is then performed directly, and their first fingerprint values ​​are added to the first-level fingerprint index table.

[0132] In this application, by rapidly matching and filtering through a first-level fingerprint index table after data segmentation, most non-duplicate data blocks can be quickly intercepted with extremely low computational and time overhead. Only successfully matched suspected duplicate data blocks are identified as candidate data blocks, reducing the amount of candidate data for L2 anchor verification and L3 strong fingerprint verification, and avoiding meaningless high-overhead computation that consumes system resources. Simultaneously, the fast query capability based on the high-speed, memory-level index table ensures the real-time performance and high throughput of the entire matching and filtering process, effectively improving the online processing efficiency of the entire data deduplication scheme.

[0133] S405. Determine the anchor point adaptation value corresponding to each sub-region data based on the local entropy value, the proportion of continuous repeating segments, and the semantic feature type of each sub-region data corresponding to the candidate data block.

[0134] Anchor point fit value is used to indicate the degree of fit between the anchor points of the sub-region data and the candidate data blocks.

[0135] Anchor point fit values ​​can be quantitative score values ​​(such as 0-10 points). The higher the fit value, the stronger the uniqueness and distinguishability of the data in the sub-region, and the more suitable it is as an anchor point for extracting regions.

[0136] In some embodiments, information entropy processing can be performed on the sub-region data to obtain the local entropy value corresponding to the sub-region data.

[0137] Specifically, the Shannon information entropy formula can be used to calculate the local entropy value of a single sub-region by applying it to all bytes of data. The formula is:

[0138]

[0139] Where k is the byte index, used to represent any byte within the sub-region, and its value ranges from 0 to 255. bytes in the sub-region The probability of occurrence is calculated; the local entropy value ranges from 0 to 8 bits. A higher entropy value indicates stronger randomness and uniqueness of bytes in the sub-region, while a lower entropy value indicates stronger repetition and regularity of bytes. A lightweight traversal method is adopted, which only calculates the probability of bytes within the sub-region, balancing computational efficiency and feature representation accuracy.

[0140] In some embodiments, the ratio of the total length of consecutive identical bytes in the sub-region data to the total length of the sub-region can be determined as the proportion of consecutive repeating segments in the sub-region data.

[0141] Specifically, the byte sequence of a single sub-region is traversed, all segments of consecutive identical bytes are identified and their total length is accumulated and recorded as the total length of consecutive repetition. The ratio of the total length of consecutive repetition to the actual total length of bytes in the sub-region is calculated, and the result is rounded to two decimal places (values ​​from 0 to 1). The lower the ratio, the lower the byte repetition and the higher the content differentiation in the sub-region, and vice versa.

[0142] For example, if the total length of the sub-region is 128 bytes, including 32 bytes of consecutive 0x00 bytes and 16 bytes of consecutive 0xFF bytes, then the total length of consecutive repeating bytes is 48 bytes, and the proportion of consecutive repeating bytes is 48 / 128 = 0.375.

[0143] In some embodiments, the semantic feature type corresponding to the sub-region data can be determined by pre-setting semantic feature rules. The semantic feature type includes semantic feature regions and non-semantic feature regions.

[0144] For structured data (such as financial transaction logs and medical structured reports), the rule is to identify field boundary markers and key field identifiers (such as transaction ID, patient ID, and timestamp prefixes). Sub-regions containing such identifiers are determined as semantic feature regions.

[0145] For unstructured data (such as medical images and text documents), the rule is to identify content feature segments (such as image file headers, document title segments, and byte mutation feature areas), and sub-regions containing such feature segments are determined as semantic feature regions.

[0146] Sub-regions that do not match the above rules are uniformly classified as non-semantic feature regions. Semantic feature regions, because they contain unique core information of the data, have higher value for anchor point extraction.

[0147] Finally, based on the anchor point adaptation rules, the local entropy value, the proportion of continuous repeating segments, and the semantic feature type of the sub-region data are processed to obtain the anchor point adaptation value.

[0148] Anchor point adaptation rules can be weighted scoring rules. For example, a hierarchical weighted calculation can be performed with a maximum score of 10, with local entropy and the proportion of consecutive repeated segments as core weights, and semantic feature type as additional weights.

[0149] Based on the anchor point adaptation rule, when dealing with local entropy values, the weight of local entropy values ​​is 40%, with a maximum score of 4 points. Specifically, a local entropy value ≥ 6.4 bits earns 4 points; 4.0 bits < local entropy value < 6.4 bits earns 2 points; and a local entropy value ≤ 4.0 bits earns 0 points.

[0150] Based on the anchor point adaptation rule, when considering the proportion of consecutive repeating segments, the weight of the proportion of consecutive repeating segments is 20%, with a full score of 2 points. Specifically, if the proportion of consecutive repeating segments is ≤20%, 2 points are awarded; if 20% < the proportion of consecutive repeating segments is ≤50%, 1 point is awarded; and if the proportion of consecutive repeating segments is >50%, 0 points are awarded.

[0151] Based on the anchor point adaptation rule, when targeting semantic feature types, the weight of semantic feature type is 40%, with a full score of 4 points. Specifically, if it is determined to be a semantic feature region, it gets 4 points; if it is determined to be a non-semantic feature region, it gets 0 points.

[0152] The anchor point fit value is the sum of the three scores, and the final result is an integer score of 0-10. The score directly reflects the degree of fit of the sub-region as the anchor point extraction region.

[0153] In this application, local entropy values, the proportion of consecutive repeating segments, and semantic feature types are extracted from candidate data blocks by region. Combined with weighted scoring anchor fitting rules, the anchor fitting values ​​are quantitatively calculated, providing a clear quantitative basis for the selection of anchor extraction regions. Simultaneously, both feature analysis and scoring calculation are lightweight operations, without the introduction of high-overhead algorithms. This ensures that the anchor fitting values ​​accurately represent the features of sub-regions while maintaining processing efficiency. It can accurately identify high-discrimination regions in candidate data blocks, effectively improving the uniqueness and discriminative power of the anchor feature value set, thereby reducing the false positive rate in the L2 anchor verification stage and enhancing the reliability of online deduplication.

[0154] Based on the above embodiments, the anchor point adaptation rule can also include multiple adaptation sub-rules. The target adaptation sub-rule can be selected from multiple adaptation sub-rules based on the business type and current load status corresponding to the target data stream, so as to realize the dynamic switching of the anchor point adaptation value calculation rule and make the anchor point extraction strategy accurately adapt to the actual business scenario and system operation status.

[0155] For example, business types can include typical enterprise-level storage scenarios such as medical imaging, financial transactions, log backup, virtual machine image backup, and audio and video data archiving.

[0156] The current load status is used to indicate the operating status of storage devices, such as high-concurrency writes, high CPU usage, low latency requirements, and high data consistency requirements.

[0157] Specifically, determine the business type and current load status corresponding to the target data stream; based on the business type and current load status, determine the target adaptation sub-rule from multiple adaptation sub-rules; based on the target adaptation sub-rule, process the local entropy value, proportion of continuous repeating segments, and semantic feature type of the sub-region data to obtain the anchor adaptation value.

[0158] The metadata of the target data stream (such as data identifier, transmission protocol, and service end identifier) ​​can be parsed through the service identification module of the storage device to determine the service type corresponding to the target data stream. At the same time, the hardware resources, service requests, and other indicators of the storage device can be collected in real time through the system monitoring module to determine the current load status.

[0159] Based on the pre-configured mapping table between service type, load status and adaptation sub-rules, the target adaptation sub-rule can be matched and determined from multiple preset adaptation sub-rules.

[0160] Among them, multiple adaptation sub-rules are differentiated weighted scoring rules preset based on different optimization objectives. By adjusting the weight allocation of three indicators—local entropy value, proportion of continuous repeated segments, and semantic feature type—and the scoring threshold, the optimization direction can be switched between performance priority, security priority, and balanced adaptation. Moreover, all adaptation sub-rules retain a quantitative scoring standard of 0-10 points for anchor point adaptation values, maintaining the same calculation logic as the basic anchor point adaptation rules.

[0161] Multiple adaptation sub-rules can be performance-priority adaptation sub-rules, security-priority adaptation sub-rules, and balanced adaptation sub-rules.

[0162] For example, in financial transaction log backup scenarios, if a high-concurrency write pressure and high CPU utilization exceeding 80% are detected on the storage device, the system switches to a performance-priority adaptation sub-rule. This appropriately reduces the weight of semantic feature types, relaxes the scoring threshold, reduces the computational overhead of sub-region feature analysis, indirectly reduces the L2 anchor verification strength, and improves data flow processing throughput. If a transmission request for core financial transaction data is detected, and data consistency requirements increase, the system switches to a security-priority adaptation sub-rule. This increases the weight of semantic feature types and local entropy values, tightens the scoring threshold, accurately selects high-discrimination anchor extraction regions, enhances L2 anchor verification strength, and reduces the false positive rate. In daily low-concurrency financial transaction log backup scenarios, a balanced adaptation sub-rule is adopted to balance the accuracy of anchor extraction with processing efficiency.

[0163] In this application, multiple sets of differentiated anchor point adaptation sub-rules are preset, and the rules are switched by combining the actual business type of the target data stream and the current load status of the storage device. This allows the calculation of anchor point adaptation values ​​to no longer be limited to fixed rules, and realizes dynamic adaptation of anchor point extraction strategies with business scenarios and system operating status. It can reduce computational overhead in business scenarios with high load and low latency requirements, ensure the processing performance and throughput of the entire data deduplication process, improve the accuracy of anchor point extraction, enhance the L2 anchor point verification effect, and reduce the false judgment rate.

[0164] S406. Based on the anchor point adaptation value corresponding to each sub-region data, determine multiple target anchor point data corresponding to candidate data blocks in multiple sub-region data.

[0165] Specifically, based on the anchor point adaptation value corresponding to the data in each sub-region, multiple candidate anchor point data corresponding to the candidate data block are determined; based on the total length of the candidate data block, the number of anchor points for multiple target anchor point data is determined; based on the number of anchor points, multiple target anchor point data are obtained from the multiple candidate anchor point data.

[0166] Prioritize selecting sub-regions (high-discrimination regions) with an adaptation value ≥ 6 points as anchor points. If the number of high-priority sub-regions is less than the preset number threshold, supplement sub-optimal sub-regions with an adaptation value of 4-5 points to determine multiple candidate anchor point data.

[0167] The number of target anchor point data needs to be dynamically adjusted in combination with the total length (L) of the candidate data block and the target adaptation sub-rules, and satisfy the constraint of "at least 2 and at most 12" (to avoid insufficient discrimination due to too few numbers and excessive computational overhead due to too many numbers).

[0168] For example, for small data blocks (L ≤ 8KB), 2-3 target anchor point data (can be compressed to 2 in the performance-priority mode); for medium data blocks (8KB < L ≤ 32KB), 3-5 target anchor point data (default 4 in the balanced mode); for large data blocks (L > 32KB): 5-12 target anchor point data (can be extended to 12 in the security-priority mode).

[0169] Multiple candidate anchor point data are in different regions of the candidate data block, and the anchor adaptation value of the candidate anchor point data is greater than the preset adaptation value.

[0170] The sub-regions corresponding to the target anchor point data are evenly distributed in the candidate data block, and the difference in the starting offsets of any two target sub-regions is ≥ the total length L of the candidate data block / the number of target anchor points, to avoid the anchors being concentrated in a certain local area and resulting in a decrease in overall discrimination.

[0171] Select sub-regions that meet the quantity constraint and distribution constraint from multiple candidate anchor point data, and the data of each selected sub-region is the target anchor point data.

[0172] At the same time, the starting / ending offsets, adaptation value scores, etc. of each target anchor point data can be recorded.

[0173] In this application, based on the quantitative screening and dynamic quantity adjustment of the anchor adaptation value, the accuracy and adaptability of the anchor extraction region can be realized, the discrimination and accuracy in the anchor verification stage can be improved, and the system computational overhead can be reduced at the same time.

[0174] S407. Perform hash processing on multiple target anchor point data to obtain an anchor point feature value set.

[0175] Execute one-way hash calculation using a lightweight hash algorithm with low computational overhead and high collision resistance (such as CRC32, xxHash64) to generate a hash value of a fixed length (such as 32 bits / 64 bits).

[0176] S408. According to the anchor point feature value set corresponding to the candidate data block, verify the repeatability of the candidate data block through the anchor point index table to obtain the first verification result.

[0177] In some embodiments, the anchor feature value set of the candidate data block can be matched with the anchor feature value set of the stored data block in the anchor index table. If the preset matching threshold is met, it is determined to be duplicate data; if the matching threshold is not met, or the anchor feature value set is not matched in the anchor index table, it is determined to be non-duplicate data.

[0178] For example, meeting the preset matching threshold can be achieved by having more than or equal to 50% of anchor points hit, and all core high-scoring anchor points hit.

[0179] Among them, the core high-scoring anchor points refer to the anchor feature values ​​corresponding to the target anchor data with an anchor fit value ≥ 8 points in the candidate data blocks. These anchor points are extracted from the high-discrimination core regions of the candidate data blocks, and have extremely strong uniqueness and recognizability. They are hard verification items for multi-anchor collaborative matching. The anchor index table is a high-speed index structure deployed in memory, which stores the set of anchor feature values ​​of all stored data blocks, as well as the mapping relationship between the set and the corresponding data block metadata (storage address, reference count, core anchor identifier). It supports high-speed batch matching of anchor feature value sets and ensures the real-time performance of the verification process.

[0180] For example, if a candidate data block generates 5 anchor points (including 2 core anchor points), and 3 anchor points are matched (including 2 core anchor points, with an overall match rate of 60%), it meets the threshold of the balanced mode and is judged as duplicate data; if only 1 of the 2 core anchor points is matched, even if the overall anchor point match rate is 80%, it is still judged as non-duplicate data; if the anchor point feature value set does not find a matching record in the anchor point index table, it is directly judged as non-duplicate data.

[0181] The preset matching thresholds under different optimization modes can be dynamically adjusted according to the current adaptation sub-rule.

[0182] For example, the threshold for the performance-first mode can be relaxed to "≥30% anchor point hits and any one core anchor point hits", reducing matching computation overhead and improving throughput; the threshold for the security-first mode can be tightened to "≥80% anchor point hits and the original bytes of the core anchor points are consistent in the second verification", further reducing the false positive rate and ensuring verification accuracy; the balanced mode uses the basic threshold of "≥50% anchor point hits and all core high-scoring anchor points hit", taking into account both verification efficiency and accuracy.

[0183] This application replaces the crude approach of single-anchor matching with a multi-anchor collaborative matching strategy based on anchor feature value sets and verification rules for core high-scoring anchors. This effectively improves the accuracy of L2 layer anchor verification and reduces the probability of missed detection of duplicate data and false detection of non-duplicate data. It achieves anchor verification only on a small number of candidate data blocks without high-overhead encryption computation throughout the process. This effectively intercepts most duplicate data blocks while avoiding unnecessary L3 layer strong fingerprint computation, reducing system CPU and I / O resource consumption. Furthermore, it improves the online processing efficiency and scenario adaptability of the entire hierarchical data deduplication scheme, achieving a dynamic balance between verification accuracy and processing performance.

[0184] S409. If the first verification result is that the candidate data block is non-repeating data, then the candidate data block is verified for repeatability through the second-level fingerprint index table to obtain the second verification result.

[0185] When a candidate data block is determined to be non-duplicate data, it means that although it is a suspected duplicate block matched by the first-level fingerprint index table, it has not passed the L2 layer anchor point collaborative verification. It still needs to be verified by the strong fingerprint of the highly secure second-level fingerprint index table to complete the final uniqueness confirmation, so as to avoid the failure to detect duplicate data due to the omission of anchor point features.

[0186] If a consistent strong fingerprint is found in the second-level fingerprint index table, it indicates that the candidate data block is duplicate data, and the second verification result is duplicate data; if no consistent strong fingerprint is found in the second-level fingerprint index table, it indicates that the candidate data block is non-duplicate data, and the second verification result is non-duplicate data.

[0187] It is worth noting that the strong fingerprint matching logic in this step is consistent with the core idea of ​​precise query comparison in weak fingerprint matching at the L1 layer. However, there are clear differences between the two in terms of fingerprint type, index table characteristics, and application scenarios. Weak fingerprint matching focuses on high-speed filtering to intercept non-duplicate data blocks, while strong fingerprint matching in this step focuses on precise fallback to ultimately confirm the uniqueness of the data.

[0188] The second-level fingerprint index table stores the strong fingerprint values ​​of the stored data blocks, as well as the mapping relationship between the strong fingerprints and the corresponding data block metadata (storage address, reference count, and associated anchor feature set). This not only supports fast querying of strong fingerprints but also provides an index supplement for anchor verification of subsequent similar data blocks.

[0189] S410. Deduplicatize candidate data blocks whose first verification result is duplicate data or whose second verification result is duplicate data.

[0190] The execution process of S401 can be found in the execution process of S205, and will not be repeated here.

[0191] The data deduplication method provided in this application dynamically adjusts the sliding window and block boundary thresholds based on local byte repetition rate and hash value distribution dispersion. It generates large-granularity data blocks in areas of high data repetition, reducing fingerprint calculation and verification overhead, and generates fine-granularity data blocks in areas of high data heterogeneity, improving the accuracy of duplicate data identification. By combining sub-regional local entropy, the proportion of consecutive repetitive segments, and semantic feature type to quantify anchor point adaptation values, it switches and adapts sub-rules according to business type and system load status, reducing anchor point verification false positive rate and avoiding ineffective, high-overhead strong fingerprint calculations. From block parameters to anchor point rules and matching thresholds, all support dynamic adjustment according to business scenarios. This meets the security requirements of high consistency in financial transaction logs and high accuracy in medical images, while also adapting to the performance requirements of high throughput in log backups and low latency in virtual machine images. It can reduce system resource consumption, improve online processing efficiency, and has strong scenario adaptability, effectively solving the problems of performance and security being difficult to balance and poor adaptability of fixed rules in deduplication schemes.

[0192] Figure 5 This is a deployment diagram of a data deduplication architecture provided in an embodiment of this application. Please refer to... Figure 5 This includes various data sources awaiting backup / archiving, such as backup clients, business hosts, and tenant business systems. For each data source, Content Definition Segmentation (CDC) and L1 weak fingerprint calculation are performed to generate data blocks to be stored and weak fingerprints.

[0193] The transferable agent node has built-in L1 weak fingerprint filtering and L2 anchor verification capabilities. It receives multiple data blocks to be stored from a data source, identifies suspected duplicate candidate data blocks through L1 weak fingerprint filtering, and performs L2 anchor feature extraction and multi-anchor collaborative verification to complete the identification and deduplication of most duplicate data. Only for a small number of candidate data blocks that fail L2 verification will subsequent L3 strong fingerprint verification be triggered.

[0194] The global central index is a global index center that stores all L3 strong fingerprints and anchor feature values, supporting tiered storage (hot and cold). Migrative proxy nodes initiate L3 fingerprint queries and synchronizations to this center to complete the final uniqueness verification of candidate data blocks. Simultaneously, migrative proxy nodes synchronize the anchor feature values ​​and strong fingerprints of newly added data to this center, continuously improving the global index data.

[0195] The storage node is the final storage node for unique, non-duplicate data blocks. The migrated agent node writes the unique, non-duplicate data blocks, which have been verified by L2 / L3, to this node. At the same time, duplicate data blocks are only marked for deduplication and are not written to physical storage, thus achieving efficient reuse of storage space.

[0196] Figure 6 This is a schematic diagram illustrating the architecture of a data deduplication method provided in an embodiment of this application. Please refer to [link / reference]. Figure 6 The target data stream is processed using Content Defined Chunking (CDC) to generate multiple data blocks of unequal length to be stored. An L1 weak fingerprint is calculated for each data block and matched against the first-level fingerprint index table.

[0197] If the weak fingerprint is not matched, the data block is directly identified as a non-duplicate data block, and a new block is written and stored on the storage node. If the weak fingerprint is matched, the data block is identified as a suspected duplicate candidate data block and proceeds to the subsequent L2 anchor verification stage.

[0198] The candidate data block is divided into sub-regions. Anchor point adaptation values ​​are calculated based on local entropy, the proportion of continuous repeating segments, and semantic feature types. High-quality sub-regions are selected and target anchor point data is extracted. An anchor point feature value set is generated through lightweight hash calculation.

[0199] The anchor feature set of candidate data blocks is matched against the anchor index table using a multi-anchor collaborative matching process. If the anchor matching fails, the data block is determined to be a non-duplicate data block, and a new block is written. If the anchor matching succeeds, the data block is determined to be a duplicate data block, and deduplication is performed. If further verification is required, the process proceeds to the L3 strong fingerprint verification stage.

[0200] If the anchor point matching verification fails, the re-slicing mechanism can be triggered, that is, the CDC block processing is re-executed on the candidate data block to generate a new data block to be stored and re-enter the weak fingerprint hit judgment process to avoid the failure to judge duplicate data due to block boundary offset.

[0201] Figure 7 This is a schematic diagram of a data deduplication device provided in an embodiment of this application. Please refer to... Figure 7 The data deduplication device 700 includes a block processing module 701, a first determination module 702, a second determination module 703, a first verification module 704, a second verification module 705, and a deduplication marking module 706.

[0202] The chunking module 701 is used to perform chunking processing on the target data stream to obtain multiple data blocks to be stored.

[0203] The first determining module 702 is used to determine candidate data blocks from multiple data blocks to be stored based on the first-level fingerprint index table.

[0204] The second determining module 703 is used to determine the set of anchor point feature values ​​corresponding to the candidate data block;

[0205] The first verification module 704 is used to perform repeatability verification on the candidate data block according to the anchor point feature value set corresponding to the candidate data block and through the anchor point index table to obtain the first verification result.

[0206] The second verification module 705 is used to perform repeatability verification on the candidate data block through the second-level fingerprint index table if the first verification result is that the candidate data block is non-repeating data, and obtain the second verification result.

[0207] The deduplication marking module 706 is used to deduplicatize candidate data blocks whose first verification result is duplicate data or whose second verification result is duplicate data.

[0208] In some possible embodiments, the second determining module 703 is specifically used for:

[0209] Based on the local entropy value, the proportion of continuous repeating segments, and the semantic feature type of each sub-region data corresponding to the candidate data block, the anchor point adaptation value corresponding to each sub-region data is determined. The anchor point adaptation value is used to indicate the degree of adaptation between the sub-region data and the anchor point of the candidate data block.

[0210] Based on the anchor point adaptation values ​​corresponding to the data in each sub-region, multiple target anchor point data corresponding to candidate data blocks are determined in multiple sub-region data, and the multiple target anchor point data are hashed to obtain a set of anchor point feature values.

[0211] In some possible embodiments, for any sub-region data; the second determining module 703 is specifically used for:

[0212] The information entropy of the sub-region data is processed to obtain the local entropy value corresponding to the sub-region data;

[0213] The ratio of the total length of consecutive identical bytes in a sub-region to the total length of the sub-region is determined as the proportion of consecutive repeating segments in the sub-region data.

[0214] By pre-setting semantic feature rules, the semantic feature type corresponding to the sub-region data is determined. The semantic feature type includes semantic feature regions and non-semantic feature regions.

[0215] Based on the anchor point adaptation rules, the local entropy value, the proportion of continuous repeating segments, and the semantic feature type of the sub-region data are processed to obtain the anchor point adaptation value.

[0216] In some possible embodiments, the anchor point adaptation rule includes multiple adaptation sub-rules; the second determining module 703 is specifically used for:

[0217] Determine the business type and current load status corresponding to the target data stream;

[0218] Based on the business type and current load status, determine the target adaptation sub-rule from multiple adaptation sub-rules;

[0219] Based on the target adaptation sub-rules, the local entropy value, the proportion of continuous repeating segments, and the semantic feature type of the sub-region data are processed to obtain the anchor point adaptation value.

[0220] In some possible embodiments, the second determining module 703 is specifically used for:

[0221] Based on the anchor point adaptation value corresponding to each sub-region data, determine multiple candidate anchor point data corresponding to the candidate data block. The multiple candidate anchor point data are in different regions of the candidate data block, and the anchor point adaptation value of the candidate anchor point data is greater than the preset adaptation value.

[0222] The number of anchor points for multiple target anchor point data is determined based on the total length of the candidate data blocks.

[0223] Based on the number of anchor points, multiple target anchor point data are obtained from multiple candidate anchor point data.

[0224] In some possible embodiments, the dicing module 701 is specifically used for:

[0225] Based on the preset sliding window, determine the local section repetition rate and hash value distribution dispersion of the target data stream;

[0226] Based on the local section repetition rate and hash value distribution dispersion, the preset sliding window and preset block boundary threshold are adjusted respectively to obtain the adjusted sliding window and adjusted block boundary threshold;

[0227] By adjusting the sliding window and the segmentation boundary threshold, the target data stream is segmented to obtain multiple data blocks to be stored.

[0228] In some possible embodiments, the first-layer fingerprint index table includes first fingerprint values ​​of multiple stored data blocks; the first determining module 702 is specifically used for:

[0229] Determine the first fingerprint value of each of the multiple data blocks to be stored;

[0230] For any data block to be stored, the first fingerprint value of the data block to be stored is matched with the first fingerprint values ​​of multiple stored data blocks to obtain the matching result corresponding to the data block to be stored.

[0231] Among multiple data blocks to be stored, the data blocks that match successfully are identified as candidate data blocks.

[0232] The data deduplication device provided in this application embodiment can execute the technical solution shown in the above method embodiment. Its implementation principle and beneficial effects are similar, and will not be described again here.

[0233] Figure 8 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Please refer to... Figure 8The electronic device 800 may include a processor 801 and a memory 802 communicatively connected to the processor 801. Exemplarily, the processor 801 and the memory 802 are interconnected via a bus 803.

[0234] The 802 memory stores instructions executed by the computer;

[0235] The processor 801 executes computer execution instructions stored in the memory 802, causing the processor 801 to perform the data deduplication method as shown in the above method embodiment.

[0236] Accordingly, embodiments of this application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the data deduplication method of the above method embodiments.

[0237] Accordingly, embodiments of this application may also provide a computer program product, including a computer program, which, when executed by a processor, can implement the data deduplication method shown in the above method embodiments.

[0238] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0239] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this application. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily essential to this application.

[0240] It should be further noted that although the steps in the flowchart are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowchart may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the sub-steps or stages of other steps.

[0241] It should be understood that the above-described device embodiments are merely illustrative, and the device of this application can also be implemented in other ways. For example, the division of units / modules in the above embodiments is only a logical functional division, and there may be other division methods in actual implementation. For example, multiple units, modules, or components may be combined, or integrated into another system, or some features may be ignored or not executed.

[0242] Furthermore, unless otherwise specified, the functional units / modules in the various embodiments of this application can be integrated into one unit / module, or each unit / module can exist physically separately, or two or more units / modules can be integrated together. The integrated units / modules described above can be implemented in hardware or as software program modules.

[0243] When integrated units / modules are implemented in hardware, the hardware can be digital circuits, analog circuits, etc. The physical implementation of the hardware structure includes, but is not limited to, transistors, memristors, etc. Unless otherwise specified, the processor can be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, and ASIC, etc. Unless otherwise specified, the storage unit can be any suitable magnetic or magneto-optical storage medium, such as Resistive Random Access Memory (RRAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High-Bandwidth Memory (HBM), Hybrid Memory Cube (HMC), etc.

[0244] If the integrated unit / module is implemented as a software program module and sold or used as an independent product, it can be stored in a computer-readable storage device (CMD). Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a memory and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned memory includes various media capable of storing program code, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard drive, magnetic disk, or optical disk.

[0245] In the above embodiments, the descriptions of each embodiment have their own emphasis. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments. The technical features of the above embodiments can be combined arbitrarily. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as the combination of these technical features does not contradict each other, it should be considered within the scope of this specification.

[0246] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this application are indicated by the following claims.

[0247] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.

Claims

1. A method of data deduplication, the method comprising: include: The target data stream is segmented into multiple data blocks to be stored. Based on the first-level fingerprint index table, candidate data blocks are determined from the plurality of data blocks to be stored; Determine the set of anchor feature values ​​corresponding to the candidate data block, and perform repeatability verification on the candidate data block according to the set of anchor feature values ​​corresponding to the candidate data block through the anchor index table to obtain the first verification result; If the first verification result indicates that the candidate data block is non-repeating data, then the candidate data block is subjected to repeatability verification through the second-level fingerprint index table to obtain the second verification result; The candidate data blocks for which the first verification result is duplicate data or the second verification result is duplicate data are marked for deduplication.

2. The method according to claim 1, characterized in that, Determining the set of anchor feature values ​​corresponding to the candidate data block includes: Based on the local entropy value, the proportion of consecutive repeating segments, and the semantic feature type of each sub-region data corresponding to the candidate data block, the anchor point adaptation value corresponding to each sub-region data is determined. The anchor point adaptation value is used to indicate the degree of adaptation between the sub-region data and the anchor point of the candidate data block. Based on the anchor adaptation values ​​corresponding to the data in each sub-region, multiple target anchor data corresponding to the candidate data block are determined in the multiple sub-region data, and the multiple target anchor data are hashed to obtain the anchor feature value set.

3. The method according to claim 2, characterized in that, For any given sub-region of data; based on the local entropy value, the proportion of consecutive repeating segments, and the semantic feature type of the sub-region of data, determine the anchor point adaptation value corresponding to the sub-region of data, including: The sub-region data is processed by information entropy to obtain the local entropy value corresponding to the sub-region data. The ratio of the total length of consecutive identical bytes in the sub-region data to the total length of the sub-region is determined as the proportion of consecutive repeating segments in the sub-region data. By pre-setting semantic feature rules, the semantic feature type corresponding to the sub-region data is determined, and the semantic feature type includes semantic feature regions and non-semantic feature regions; Based on the anchor point adaptation rules, the local entropy value, the proportion of continuous repeating segments, and the semantic feature type of the sub-region data are processed to obtain the anchor point adaptation value.

4. The method according to claim 3, characterized in that, The anchor point adaptation rule includes multiple adaptation sub-rules; based on the anchor point adaptation rule, data processing is performed on the local entropy value, the proportion of consecutive repeating segments, and the semantic feature type of the sub-region data to obtain the anchor point adaptation value, including: Determine the service type and current load status corresponding to the target data stream; Based on the business type and the current load status, determine the target adaptation sub-rule from the plurality of adaptation sub-rules; Based on the target adaptation sub-rules, the local entropy value, the proportion of continuous repeating segments, and the semantic feature type of the sub-region data are processed to obtain the anchor point adaptation value.

5. The method according to claim 2, characterized in that, Based on the anchor point adaptation values ​​corresponding to the data in each sub-region, multiple target anchor point data corresponding to the candidate data block are determined from the multiple sub-region data, including: Based on the anchor point adaptation value corresponding to each sub-region data, multiple candidate anchor point data corresponding to the candidate data block are determined. The multiple candidate anchor point data are in different regions of the candidate data block, and the anchor point adaptation value of the candidate anchor point data is greater than a preset adaptation value. The number of anchor points for the multiple target anchor point data is determined based on the total length of the candidate data blocks. Based on the number of anchor points, the target anchor point data is obtained from the candidate anchor point data.

6. The method according to claim 1, characterized in that, The target data stream is segmented into multiple data blocks to be stored, including: Based on a preset sliding window, determine the local section repetition rate and hash value distribution dispersion of the target data stream; Based on the local section repetition rate and hash value distribution dispersion, the preset sliding window and preset segmentation boundary threshold are adjusted respectively to obtain the adjusted sliding window and adjusted segmentation boundary threshold; Based on the adjustment of the sliding window and the adjustment of the segmentation boundary threshold, the target data stream is segmented to obtain the multiple data blocks to be stored.

7. The method according to claim 1, characterized in that, The first-level fingerprint index table includes the first fingerprint values ​​of multiple stored data blocks; Based on the first-level fingerprint index table, candidate data blocks are determined from the plurality of data blocks to be stored, including: Determine the first fingerprint value of each of the plurality of data blocks to be stored; For any data block to be stored, the first fingerprint value of the data block to be stored is matched with the first fingerprint values ​​of the plurality of stored data blocks to obtain the matching result corresponding to the data block to be stored. The storage data block that is successfully matched among the multiple data blocks to be stored is determined as the candidate data block.

8. A data deduplication device, characterized in that, It includes a chunking module, a first determination module, a second determination module, a first verification module, a second verification module, and a deduplication marking module: The block processing module is used to perform block processing on the target data stream to obtain multiple data blocks to be stored. The first determining module is used to determine candidate data blocks among the plurality of data blocks to be stored based on the first-layer fingerprint index table; The second determining module is used to determine the set of anchor point feature values ​​corresponding to the candidate data block; The first verification module is used to perform repeatability verification on the candidate data block according to the anchor point feature value set corresponding to the candidate data block and through the anchor point index table to obtain a first verification result; The second verification module is used to, if the first verification result indicates that the candidate data block is non-repeating data, perform repeatability verification on the candidate data block through the second-layer fingerprint index table to obtain a second verification result; The deduplication marking module is used to deduplicatize candidate data blocks whose first verification result is duplicate data or whose second verification result is duplicate data.

9. An electronic device, characterized in that, include: A processor, and a memory communicatively connected to the processor; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory to implement the method as described in any one of claims 1-7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions that, when executed by a processor, are used to implement the method described in any one of claims 1-7.

11. A computer program product, characterized in that, Includes a computer program that, when executed by a processor, implements the method described in any one of claims 1-7.