Data processing method and apparatus, and device

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By using a hierarchical graph index structure to quickly retrieve and reduce similar data blocks, the problem of excessive resource consumption in deduplication is solved, achieving efficient data reduction and system performance protection.

WO2026129770A1PCT designated stage Publication Date: 2026-06-25HUAWEI TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: HUAWEI TECH CO LTD
Filing Date: 2025-09-16
Publication Date: 2026-06-25

Application Information

Patent Timeline

16 Sep 2025

Application

25 Jun 2026

Publication

WO2026129770A1

IPC: G06F3/06

AI Tagging

Application Domain

Input/output to record carriers

Technology Topics

Data pack Algorithm

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN2025121599_25062026_PF_FP_ABST

Patent Text Reader

Abstract

A data processing method and apparatus, relating to the technical field of data storage. The method comprises: determining feature data of a plurality of data blocks, the feature data comprising gradient fingerprints, and the gradient fingerprints being used to characterize content features of multiple dimensions of the data blocks to which the gradient fingerprints belong; on the basis of the degree of gradient fingerprint matching of data block pairs formed by the plurality of data blocks, generating a hierarchical graph index structure of the plurality of data blocks, the hierarchical graph index structure comprising a plurality of graph layers having different hierarchical levels, the plurality of graph layers being used to distribute the data block pairs, and the degree of gradient fingerprint matching of the data block pairs corresponding to the levels of the graph layers in which the data block pairs are distributed; according to a hierarchical order of the plurality of graph layers, traversing, from the hierarchical graph index structure, second data blocks each forming a data block pair with a first data block, the first data block being one of the plurality of data blocks; and, on the basis of a traversal result, performing data reduction processing on the first data block and the second data blocks.

Need to check novelty before this filing date? Find Prior Art

Description

A data processing method, apparatus and equipment

[0001] This application claims priority to Chinese Patent Application No. 2024118979301, filed on December 19, 2024, entitled “A Data Processing Method, Apparatus and Device”, the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of data storage technology, and in particular to a data processing method, apparatus, and device. Background Technology

[0003] To manage and utilize storage resources more effectively and improve storage efficiency, data deduplication (DD) is generally performed on data blocks to reduce data redundancy. Data deduplication, also known simply as deduplication, is a method that stores identical data only once, thereby saving space. This often occurs at the file or block level; completely identical files or blocks are stored only once. However, files or blocks with similar but not identical content require processing using Similar Dedup Process (SDP). Similar Dedup is a data reduction technique that combines similarity search and differential compression. It primarily uses algorithms to find related "similar data" and applies Delta compression or merge compression algorithms to further reduce data size.

[0004] Typically, the similarity deduplication process includes: First, calculating a similarity fingerprint (SFP) for all data blocks. The similarity fingerprint determines whether and to what extent data blocks are similar. After comparing and filtering the similarity fingerprints, data blocks with similar data are clustered together. Then, differential compression algorithms (delta compression or merge compression algorithms) support the compression of differences between similar data blocks. By extracting byte-level differences between similar data blocks, identical fields between them are eliminated, allowing only the difference fields of clusters of similar data blocks to be stored, thus achieving a reduction ratio.

[0005] However, calculating the similarity fingerprints between all pairs of data blocks (or files) and the similarity of these similar fingerprints is computationally expensive. Furthermore, the metadata information of the data blocks (including the similarity fingerprint hash table) is entirely cached in memory during the calculation process, consuming significant memory resources and thus noticeably impacting foreground read / write (IO) speed. To avoid a significant decrease in foreground read / write speed caused by background deduplication tasks, it is often necessary to reduce the data reduction speed or data reduction ratio of the background deduplication task; alternatively, more resources could be allocated to execute deduplication tasks, but this would exacerbate the impact on foreground read / write speed. Therefore, how to balance data reduction ratio, reduction speed, and storage system performance in deduplication scenarios is a problem that urgently needs to be solved. Summary of the Invention

[0006] This application provides a data processing method, apparatus, computing device, device cluster, computer storage medium, and computer program product that can quickly find similar data blocks, save resources, thereby improving data reduction speed, ensuring data reduction ratio, and reducing adverse effects on system performance.

[0007] Firstly, embodiments of this application provide a data processing method. This method, after receiving user data, performs similarity block search and reduction processing on a block-by-block basis. In this method, characteristic data of these data blocks can be determined. This characteristic data includes gradient fingerprints, which characterize the content features of the data blocks in multiple dimensions. Thus, the higher the gradient fingerprint matching degree between two data blocks (i.e., a data block pair), the more dimensions the content of the block pair is identical or similar, meaning there is more identical or similar content between the block pair. Next, a hierarchical graph index structure for the data block pairs can be generated based on the gradient fingerprint matching degree of the data block pairs. The hierarchical graph index structure includes multiple layers with high and low levels, used to distribute the data block pairs. The gradient fingerprint matching degree of the data block pairs corresponds to the level of the distributed layers; for example, data block pairs with higher gradient fingerprint matching degrees are distributed in higher layers within the hierarchical graph index structure. Thus, when performing similarity block lookup on these data blocks, for any data block (i.e., the first data block), the system traverses each of the second data blocks that are data block pairs with the first data block in the hierarchical order of multiple layers from the layered graph index structure, obtaining the corresponding traversal results. Therefore, without comparison calculations, the second data block with a higher degree of similarity to the first data block is traversed first. In other words, the traversal result is a set of similar blocks arranged in order of their degree of similarity. Based on this traversal result, data reduction processing can be directly performed on adjacent first and second data blocks.

[0008] In this way, based on the hierarchical graph index structure, similar blocks can be quickly retrieved according to their similarity levels. The traversal results directly reflect the similarity between blocks, allowing for direct reduction of adjacent blocks in the traversal order without additional calculations. This improves reduction speed and data reduction ratio. Furthermore, since similar block lookup and reduction operations do not rely on fingerprint calculations, it also reduces memory and computational overhead, thus minimizing the impact on foreground I / O.

[0009] In some possible examples, the feature data also includes similar fingerprints. After determining the feature data of several data blocks, the method may include: sorting the data blocks according to the lexicographical order of their similar fingerprints to obtain a block sequence. This ensures that data blocks with matching similar fingerprints are adjacent to each other. Then, in generating a hierarchical graph index structure for several data blocks based on the gradient fingerprint matching degree of the data block pairs formed by the several data blocks, the method may include: calculating the gradient fingerprint matching degree of each data block pair within the block sequence using a preset sliding window, where the sliding window size is W data blocks, W≥2; and generating a corresponding hierarchical graph index structure based on the gradient fingerprint matching degree of each data block pair.

[0010] In this way, the number of calculations for data blocks can be controlled by using a sliding window, so that a limited number of data blocks are distributed in the hierarchical graph index structure. This helps to save computing and memory resources and reduce the impact on front-end I / O while ensuring the data reduction ratio.

[0011] In some possible examples, a hierarchical graph index structure for several data blocks is generated based on the gradient fingerprint matching degree of the data block pairs formed by the several data blocks. This includes: determining a subset of blocks from the several data blocks, in which any data block is paired with at least one other data block, and the data blocks in the data block pairs have the same similar fingerprint. If the similar fingerprints are the same, it means that the contents of the block pair may be the same or similar; and generating a corresponding hierarchical graph index structure based on the gradient fingerprint matching degree of each data block pair in the block subset.

[0012] In this way, before generating a hierarchical graph index structure of several data blocks, block pairs with the same similar fingerprints are filtered so that only potentially similar data block pairs are distributed in the hierarchical graph index structure. This avoids adding data blocks that have no similarity to the index structure, which helps control the size of the index structure, reduces the consumption of memory resources, and improves the retrieval speed.

[0013] In some possible examples, several data blocks are divided into multiple batches, including a first batch and a second batch. The method may include: constructing an initialized hierarchical graph index structure, where the number of layers, the number of data blocks each layer can hold, etc., can be set; then, data block pairs from the first batch are added to the hierarchical graph index structure based on gradient fingerprint matching. Thus, when searching for similar blocks from the first batch, the hierarchical graph index structure can be traversed according to the hierarchical order of the multiple layers, identifying each second data block that is a data block pair with the first data block. After retrieving each second data block, the record of the first data block in the hierarchical graph index structure is cleared; the first data block is one of the data blocks from the first batch. Afterwards, with all records of the first batch cleared from the hierarchical graph index structure, data block pairs from the second batch can be added to the same hierarchical graph index structure based on gradient fingerprint matching.

[0014] In this way, by processing several data blocks in batches, the amount of data targeted for each retrieval and reduction is smaller, resulting in less memory and computational overhead, which helps to reduce the impact on front-end I / O. Furthermore, the reuse of the hierarchical graph index structure between different batches avoids the repeated allocation of memory resources occupied by the hierarchical graph index structure, reducing memory consumption, saving time, and improving reduction speed.

[0015] In some possible examples, within each batch, the first element of similar fingerprints of data blocks is the same, and / or the total size of data blocks in each batch does not exceed the capacity of the cache used to process that batch. Thus, where cache allows, this approach aims to process potentially identical or similar blocks in the same batch to improve the data reduction ratio.

[0016] In some possible examples, each data block has a marker array that records the layers in which the data block is distributed within the hierarchical graph index structure. Following the hierarchical order of multiple layers, the process involves traversing each second data block that is a data block pair with the first data block from the hierarchical graph index structure. This includes: selecting a first data block from a set of data blocks; traversing the first data block and its corresponding second data blocks layer by layer according to the hierarchical order of the layers described by the marker array of the first data block, wherein the traversal order of each data block first traversed from the hierarchical graph index structure is recorded using a first queue; and clearing the record for the first data block on any given layer after traversing all second data blocks on that layer. In this way, managing the order and number of traversals of each data block through a queue avoids redundant traversals that would cause repeated reduction operations.

[0017] In some possible examples, traversing each second data block that is a data block pair with the first data block from the hierarchical graph index structure according to the hierarchical order of multiple layers may include: adding first tuple information to a second queue based on the tag array of the first data block. The second queue is used to store the tuple information of the data blocks to be traversed. This tuple information includes the number of the highest layer level of the data block in the hierarchical graph index structure. The first tuple information is the tuple information of the first data block. If the first tuple information is located at the head of the second queue, then in the highest layer described by the first tuple information, traversing... Traverse the second data blocks corresponding to the first data block, add the first traversed second data block to the first queue, and add the tuple information of each traversed second data block to the second queue; clear the record of the first data block in the highest layer, and update the tag array of the first data block; generate second tuple information based on the updated tag array and add it to the second queue; if the second tuple information is at the head of the second queue, traverse the highest layer described by the second tuple information to traverse the second data blocks corresponding to the first data block, until all the second data blocks corresponding to the first data block are obtained.

[0018] In some possible examples, based on the traversal results, the first data block and the second data block are subjected to data reduction processing, including: dividing all data blocks in the first queue into groups according to a preset granularity, with at least one group including the first data block and at least one second data block; and reducing the data blocks in each group.

[0019] In this way, since the data blocks in the queue are obtained according to their similarity, they can be directly grouped according to the preset granularity and compressed without additional calculations, thus saving resources.

[0020] Secondly, embodiments of this application provide a data processing apparatus, comprising: a feature data extraction module for determining feature data of several data blocks, the feature data including gradient fingerprints, the gradient fingerprints being used to characterize the content features of the data blocks in multiple dimensions; a hierarchical graph index construction module for generating a hierarchical graph index structure of several data blocks based on the gradient fingerprint matching degree of the data block pairs formed by the several data blocks, the hierarchical graph index structure including multiple layers with high and low levels, the multiple layers being used to distribute data block pairs, the gradient fingerprint matching degree of the data block pairs corresponding to the level of the distributed layers; a hierarchical graph index retrieval module for traversing each second data block that is a data block pair with a first data block from the hierarchical graph index structure according to the hierarchical order of the multiple layers, the first data block being one of several data blocks; and a data reduction module for performing data reduction processing on the first data block and the second data block based on the traversal results.

[0021] In some possible examples, the feature data also includes similar fingerprints. The feature data extraction module is also used to sort the data blocks according to the lexicographical order of similar fingerprints to obtain a block sequence. The hierarchical graph index construction module is used to: calculate the gradient fingerprint matching degree of each data block pair located in the sliding window using a preset sliding window, the size of the sliding window is W data blocks, W≥2; and generate the corresponding hierarchical graph index structure according to the gradient fingerprint matching degree of each data block pair.

[0022] In some possible examples, the hierarchical graph index building module is used to: determine a subset of blocks from several data blocks, in which any data block is a data block pair with at least one other data block, and the data blocks in the data block pairs have the same similarity fingerprint; and generate a corresponding hierarchical graph index structure based on the gradient fingerprint matching degree of each data block pair in the subset.

[0023] In some possible examples, several data blocks are divided into multiple batches, including a first batch and a second batch. The hierarchical graph index construction module is specifically used to: construct an initialized hierarchical graph index structure; add data block pairs from the first batch of data blocks to the hierarchical graph index structure according to the gradient fingerprint matching degree; the hierarchical graph index retrieval module is specifically used to: traverse each second data block that is a data block pair with the first data block from the hierarchical graph index structure according to the hierarchical order of multiple layers, and clear the record of the first data block in the hierarchical graph index structure after obtaining each second data block, where the first data block is one of the data blocks in the first batch; the system also includes a hierarchical graph index structure reuse module, which is used to add data block pairs from the second batch to the hierarchical graph index structure according to the gradient fingerprint matching degree when all records of the first batch of data blocks are cleared in the hierarchical graph index structure.

[0024] In some possible examples, the first element of similar fingerprints of data blocks in each batch is the same, and / or the total size of data blocks in each batch does not exceed the capacity of the buffer used to process that batch.

[0025] In some possible examples, each data block has a marker array that records the layer hierarchy in which the data block is distributed in the hierarchical graph index structure; the hierarchical graph index retrieval module is used to: select a first data block from several data blocks; traverse the first data block and its corresponding second data blocks layer by layer according to the hierarchical order of the layers described by the marker array of the first data block, wherein the traversal order of each data block first traversed from the hierarchical graph index structure is recorded by a first queue; after traversing all second data blocks on any layer, clear the record of the first data block on that layer.

[0026] In some possible examples, the hierarchical graph index retrieval module is used to: add first tuple information to a second queue based on the tag array of the first data block. The second queue is used to store tuple information of the data blocks to be traversed. The tuple information includes the number of the highest layer level of the data block in the hierarchical graph index structure. The first tuple information is the tuple information of the first data block. If the first tuple information is at the head of the second queue, then in the highest layer described by the first tuple information, traverse the second data blocks corresponding to the first data block, add the first traversed second data block to the first queue, and add the tuple information of each traversed second data block to the second queue. Clear the record of the first data block in the highest layer and update the tag array of the first data block. Based on the updated tag array, generate second tuple information and add it to the second queue. If the second tuple information is at the head of the second queue, then traverse the highest layer described by the second tuple information to traverse the second data blocks corresponding to the first data block until all the second data blocks corresponding to the first data block are obtained.

[0027] In some possible examples, the data reduction module is specifically used to: divide all data blocks in the first queue into groups according to a preset granularity, with at least one group including a first data block and at least one second data block; and reduce the data blocks in each group.

[0028] Thirdly, embodiments of this application provide a computing device, including: at least one memory for storing a program; and at least one processor for executing the program stored in the memory; wherein, when the program stored in the memory is executed, the processor is used to execute the method described in the first aspect or any possible implementation of the first aspect.

[0029] Fourthly, embodiments of this application provide a computing device cluster, including at least one computing device, each computing device including a processor and a memory; the processor of the at least one computing device is used to execute instructions stored in the memory of the at least one computing device, so that the computing device cluster performs the method described in the first aspect or any possible implementation of the first aspect.

[0030] Fifthly, embodiments of this application provide a computer-readable storage medium storing a computer program that, when run on a processor, causes the processor to perform the method described in the first aspect or any possible implementation thereof.

[0031] In a sixth aspect, embodiments of this application provide a computer program product, characterized in that, when the computer program product is run on a processor, it causes the processor to execute the method described in the first aspect or any possible implementation of the first aspect.

[0032] In a seventh aspect, embodiments of this application provide a chip, characterized in that it includes at least one processor and an interface; at least one processor obtains program instructions or data through the interface; at least one processor is used to execute program line instructions to implement the method described in the first aspect or any possible implementation of the first aspect.

[0033] It is understood that the beneficial effects of the second to seventh aspects mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here. Attached Figure Description

[0034] Figure 1 is a schematic diagram of the structure of a data processing device provided in an embodiment of this application;

[0035] Figure 2 is a schematic diagram of the data processing device performing data reduction processing in an embodiment of this application;

[0036] Figure 3 is a schematic diagram of the structure of a data processing device provided in an embodiment of this application;

[0037] Figure 4 is a schematic diagram of the data processing device performing data block processing in an embodiment of this application;

[0038] Figure 5 is a schematic diagram of the structure of a data processing device provided in an embodiment of this application;

[0039] Figure 6 is a schematic diagram of the generation of the hierarchical graph index structure in an embodiment of this application;

[0040] Figure 7 is a schematic diagram of similar block search based on hierarchical graph index structure in an embodiment of this application;

[0041] Figure 8 is a flowchart illustrating a data processing method provided in an embodiment of this application;

[0042] Figure 9 is a flowchart illustrating a data processing method provided in an embodiment of this application;

[0043] Figure 10 is a flowchart illustrating a data processing method provided in an embodiment of this application;

[0044] Figure 11 is a schematic diagram of the structure of a computing device provided in an embodiment of this application;

[0045] Figure 12 is a schematic diagram of the structure of a computing device provided in an embodiment of this application;

[0046] Figure 13 is a schematic diagram of the structure of a computing device provided in an embodiment of this application. Detailed Implementation

[0047] In this document, the term "and / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. The symbol " / " indicates that the related objects are in an "or" relationship; for example, A / B means A or B. In the description of the embodiments in this application, unless otherwise stated, "multiple" means two or more. For example, multiple processing units refer to two or more processing units; multiple components refer to two or more components.

[0048] To facilitate understanding of the technical solution of this application, the technical terms involved in this document will be explained below.

[0049] Similar Dedup Process (SDP): Similarity deduplication is a data reduction process that removes duplicate data from a system and performs Delta compression / merging compression on similar data.

[0050] Fingerprint (FP): Specifically refers to a fingerprint in a deduplication scenario. It is typically calculated using a strong hash algorithm and is used to describe the content of a data block. Two data blocks having the same fingerprint indicate that their data content is identical, and deduplication can be performed on them.

[0051] SFP (Similar Fingerprint): A similarity fingerprint used to describe the similar characteristics of data. Two data blocks having the same similarity fingerprint means that their data content is completely identical or partially the same.

[0052] Gradient fingerprint (GFP): This is a fingerprint used to describe the gradient of similarity between data. By comparing the gradient fingerprints of two similar data blocks, the degree of similarity between the data can be determined.

[0053] Similarity Degree: Calculated based on the gradient fingerprints of two data blocks, it represents the amount of identical content between the two data blocks. The range of similarity values is related to the similarity of the block content.

[0054] Similar data refers to data blocks that have a large amount of identical content, good Delta compression / merging compression effect, and similar data usually have completely identical SFP.

[0055] Similar group: refers to a set of similar data.

[0056] Delta compression is a compression algorithm that takes two data blocks as input, selects one as a reference block and the other as the target block, and compresses the target block based on the reference block (the more similar the data blocks are, the higher the compression benefit).

[0057] Combining compression refers to concatenating all data blocks within a similar group, using the result as input to a compression algorithm, and storing the compressed output.

[0058] Reference Block: The base data block selected when performing Delta compression on two data blocks; the first data block to be merged and compressed can also be called the reference block.

[0059] The target block is the data block that is compressed when two similar data blocks are subjected to Delta compression.

[0060] Similar Combined Block: After similar groups undergo Delta compression or merge compression, the compressed encoded information block needs to be stored.

[0061] Data Reduction Ratio (DDR): The ratio of the amount of data before deduplication to the amount of data after deduplication is the data reduction ratio.

[0062] In some deduplication schemes, the storage system creates two hash tables to store the fingerprint (FP) of a data block and similar fingerprints, respectively. The values of the hash tables are the sets of data block information. When the storage system receives a data block to be stored, if the fingerprint of the data block exists in the fingerprint hash table, the data block is deleted as a duplicate. If it does not exist, the fingerprint of the data block is added to the fingerprint hash table, and similar fingerprints are calculated. If the calculated similar fingerprint exists in the similar fingerprint hash table, the data block is incrementally compressed with the last data block in the corresponding data block information set; otherwise, the data block cannot be incrementally compressed, the entire data block is stored on disk, and the similar fingerprint hash table is updated accordingly. However, this scheme requires both hash tables to be stored in memory, consuming a large amount of memory resources and significantly impacting foreground read / write (IO). Furthermore, when searching for reference blocks for incremental compression, only data blocks with similar fingerprints are considered. This means that only reference blocks with a high degree of similarity can be found. If no highly similar reference block exists, incremental compression is not performed, ignoring the fact that data blocks with low similarity can also be incrementally compressed, resulting in poor data reduction. In addition, even with similar fingerprints, incremental compression between different data blocks will still differ. Selecting the last data block from those with similar fingerprints as the reference block is not optimal and will also lead to poor data reduction.

[0063] In some similarity deduplication schemes, similarity fingerprints can be omitted. The storage system directly groups consecutive data blocks and identifies duplicate data blocks based on their fingerprints. It then attempts to incrementally compress the non-duplicate data blocks outside the duplicate blocks. If the reduction ratio reaches 1 / 2, the non-duplicate data blocks are considered not similar, the compression is discarded, and the non-duplicate data blocks are stored. Otherwise, the compression result is retained. Clearly, the incremental compression of data blocks in this scheme incurs significant computational overhead, impacting foreground I / O. Foreground IOPS (read / write operations per second) is a key performance indicator for storage systems. To ensure system performance, it's often necessary to adjust low-priority background tasks like similarity deduplication, such as reducing the data reduction speed or ratio; or allocate more computing and memory resources to perform deduplication tasks. However, this often exacerbates the impact on foreground I / O and is therefore not advisable.

[0064] In summary, how to balance data reduction ratio, data reduction speed, and system performance is an urgent problem to be solved.

[0065] To improve data reduction speed and ratio while minimizing the impact on system performance in scenarios involving similar data deletion, this application provides a data processing method. This method primarily uses a multi-layered hierarchical graph index structure to represent the similarity relationships and degrees of similarity between data blocks. This eliminates the need for comparative calculations; data block pairs with different degrees of similarity can be quickly retrieved based on the hierarchical graph index structure for corresponding reduction operations. This results in high retrieval efficiency, improving reduction speed and ratio. Furthermore, the hierarchical graph index structure consumes relatively few resources, thus reducing the impact on system performance.

[0066] The architecture of a data processing device provided in an embodiment of this application is described below. This data processing device can be deployed in a centralized manner or in a distributed manner.

[0067] For example, Figure 1 shows a schematic diagram of the architecture of a centrally deployed data processing device provided in an embodiment of this application. As shown in Figure 1, the data processing device 100 can be deployed on a computing device 10, which can be a physical computer or a virtualization device, such as a virtual machine or a container.

[0068] In this example, the computing device 10 may include a processor 110, a memory 120 and a network interface 130 communicatively connected to the processor 110, wherein the processor 110 is the control center of the computing device 10. The processor 110 may be a central processing unit (CPU), graphics processing unit (GPU), data processing unit (DPU), neural processing unit (NPU), etc., but is not limited thereto. The memory 120 may be RAM, such as random access memory (RAM), used to store instructions or data that the processor 110 may access multiple times. The memory 120 may also be a persistent storage disk, such as a hard disk, optical disk, register, read-only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), non-volatile RAM (NVRAM), etc. The memory 120 can also be used to store programs related to this embodiment. The network interface 130 may optionally include a standard wired interface, a wireless interface (including but not limited to Wi-Fi, mobile communication interface, etc.), etc., for data transmission and reception under the control of the processor 110.

[0069] The device structure shown in Figure 1 does not constitute a limitation on the computing device 10 and may include more or fewer components than shown, or combine certain components, or have different component arrangements.

[0070] In this embodiment, the computing device 10 can receive data written by the user through the network interface 130, and then the data processing device 100 analyzes and processes it to retrieve corresponding duplicate data and perform deduplication (i.e., reduction) operations. The deduplicated data blocks are then written to the data persistence layer 200 (i.e., the file system or physical storage medium) to reduce data redundancy. The processor 110 and memory 120 provide the computing and memory resources required by the data processing device 100 during the deduplication operation.

[0071] Specifically, the data processing device 100 may include a feature data extraction module 101, a hierarchical graph index construction module 102, a hierarchical graph index retrieval module 103, and a data reduction module 104. Wherein:

[0072] The feature data extraction module 101 can be used to determine the feature data of several data blocks. Among them, the feature data includes gradient fingerprints, which are used to characterize the content features of the data block in multiple dimensions.

[0073] The hierarchical graph index construction module 102 is used to generate a hierarchical graph index structure for several data blocks based on the gradient fingerprint matching degree of the data block pairs formed by the data blocks. The hierarchical graph index structure includes multiple layers with high and low levels. The multiple layers are used to distribute data block pairs. The gradient fingerprint matching degree of the data block pairs corresponds to the level of the distributed layers.

[0074] The layered graph index retrieval module 103 is used to traverse each of the second data blocks that are data block pairs with the first data block from the layered graph index structure according to the hierarchical order of multiple layers. The first data block is one of several data blocks.

[0075] The data reduction module 104 is used to perform data reduction processing on the first data block and the second data block based on the traversal results.

[0076] It should be noted that in practical applications, the data processing device 100 shown in Figure 1 can be implemented by software, or by hardware or a combination of hardware and software.

[0077] Data processing system 100, as an example of a software functional unit, may include code running on computing instances. These computing instances may include at least one of a host, a virtual machine, and a container. Further, the aforementioned computing instances may be one or more. For example, data processing system 100 may include code running on multiple hosts / virtual machines / containers. It should be noted that the multiple hosts / virtual machines / containers used to run the code may be distributed within the same region or in different regions. Further, the multiple hosts / virtual machines / containers used to run the code may be distributed within the same availability zone (AZ) or in different AZs, each AZ including one or more geographically proximate data centers. Typically, a region may include multiple AZs.

[0078] Similarly, multiple hosts / virtual machines / containers used to run this code can be distributed within the same Virtual Private Cloud (VPC) or across multiple VPCs. Typically, a VPC is set up within a region. Communication between two VPCs within the same region, as well as between VPCs in different regions, requires a communication gateway to be set up within each VPC to enable interconnection between VPCs.

[0079] In some possible implementations, the data processing process of the data processing system 100 can be as shown in Figure 2. Specifically, the data written by the user is received by the data processing system 100 and then enters the feature extraction module 101. The feature extraction module 101 can divide the data into several data blocks in step S1, and calculate the gradient fingerprint (GFP) of each data block in step S2. The GFP of each data block is a multi-dimensional vector containing multiple feature values. One feature value can characterize the content feature of one dimension of the data block. These data blocks can be paired up to obtain multiple data block pairs. The more consistent the data block pairs are in different dimensions of their GFP, the higher the similarity between the two data block pairs.

[0080] Next, the hierarchical graph index construction module 102 generates a hierarchical graph index structure (index) for the data blocks based on the gradient fingerprint GFP matching degree of each data block pair in step S3. This hierarchical graph index structure includes multiple layers with high and low levels. Each layer can distribute multiple data block pairs (i.e., node pairs connected by edges), and the higher the gradient fingerprint GFP matching degree of the data block pair, the higher the layer it is distributed in. In this way, when a data block is paired with multiple other data blocks, the hierarchical graph index retrieval module 103 executes step S4. Without comparison and calculation, it can directly and quickly traverse the hierarchical graph index structure layer by layer to find the most similar blocks of the data block from high to low similarity, according to the hierarchical order. For example, in Figure 2, after searching, data blocks A and B are found to be the most similar to each other, while data block C is slightly less similar to data block B (data block pairs B and C are distributed at a lower level than data block pairs A and B in this index), and data block C is also the most similar to data block D; similarly, data blocks F and G are found to be the most similar to each other, data block H is slightly less similar to data block G, data blocks J and K are found to be the most similar to each other, and so on. It is evident that the blocks with higher similarity in the traversal results of the hierarchical graph index retrieval module 103 are closer together. Based on this traversal result, the data reduction module 104 directly performs reduction processing such as merging and compressing adjacent blocks, i.e., step S5, to obtain a high data reduction ratio. Therefore, in the process of retrieving and reducing similar blocks, the processing logic complexity is low, the computational overhead is low, and the memory overhead is also small, which helps to reduce the impact on front-end (i.e., user-facing interface or services, etc.) read / write (IO).

[0081] In some other possible implementations, the higher the similarity of the data block pairs, the lower the layer of the distribution on the hierarchical graph index structure. In this way, the similarity of the data blocks can be traversed layer by layer from low level to high level, and the traversal results obtained can be reduced to obtain a high reduction ratio.

[0082] Furthermore, in some possible implementations, the data processing system 100 is deployed in a distributed architecture as shown in Figure 3. The modules of the data processing system 100 can be distributed across multiple computing devices 10 of the device cluster 00. These multiple computing devices 10 communicate and interact with each other, jointly providing the computing, storage, and network resources required by the data processing system 100. It should be understood that, similar to the centralized deployment described above, a distributed data processing system 100 can also achieve the aforementioned data processing process, which will not be elaborated further.

[0083] Next, the process of data reduction performed by the data processing apparatus 100 provided in the embodiments of this application will be described.

[0084] In this embodiment, the data processing device 100 receives data written by the user, and then the feature data extraction module 101 divides this data into blocks according to a preset granularity to obtain multiple data blocks of equal size. The preset granularity can be 8 kibits (KiB), or a larger or smaller granularity, such as 4 KiB, 64 KiB, etc., but is not limited to these. For example, the feature data extraction module 101 can generate data blocks one by one as data is continuously written, so the generated data blocks can have a sequential order.

[0085] Next, referring to Figure 4, the feature data extraction module 101 executes step S11, sequentially performing hash calculations on the contents of these data blocks to obtain feature data for each data block. This feature data may include similarity fingerprints (SFP) and gradient fingerprints (GFP). The similarity fingerprint (SFP) of a data block can characterize the similarity features of its content. If two data blocks have the same SFP, then the contents of the two data blocks are completely identical or partially identical. As a specific example, a weak hashing algorithm can be used when calculating the similarity fingerprint (SFP) of data blocks, thereby relaxing the similarity matching criteria for data blocks. This makes it easier for data blocks to generate the same SFP even when there is little similarity (i.e., low similarity), which helps to improve the data reduction ratio.

[0086] For example, the GFP of a data block can be a multidimensional vector formed by multiple feature values of the data block. These feature values can be calculated from the content of the data block using different hash functions, thus reflecting the feature similarity of the data blocks in different dimensions. If two data blocks have the same feature value in any of the same dimensions of the GFP, it means that the two data blocks are similar in that dimension. For example, if the GFP of two data blocks both contain 12 feature values, and the feature values of the two GFPs are the same in the 1st, 2nd, and 3rd dimensions, it means that the two data blocks are similar in those 1st, 2nd, and 3rd dimensions. Thus, the more dimensions that data blocks have in common in the GFP dimensions, the higher the degree of similarity between the data blocks. Therefore, the degree of similarity can be quantified by calculating the distance between the GFPs of two data blocks. For example, if the GFPs of two data blocks have 9 identical dimensions, then the similarity between the two data blocks is quantified as 9.

[0087] Referring again to Figure 4, in this embodiment, the feature data extraction module 101 can also execute step S12, sorting the data blocks according to the lexicographical order of their similar fingerprints (SFPs). If the similar fingerprints (SFPs) of the data blocks are completely identical, they are sorted according to the lexicographical order of the data blocks. Thus, in the data block sequence output by the feature data extraction module 101, data blocks with the same similar fingerprints (SFPs) can be placed in adjacent positions. For example, feature data is calculated sequentially for several data blocks A, B, C, D, ... The similar fingerprints (SFPs) and gradient fingerprints (GFPs) of each data block are similar to A_SFP and A_GFP of data block A shown in Figure 4, where A_GFP is a multidimensional vector composed of multiple hash values. Therefore, the feature data extraction module 101 sorts the data blocks according to their SFPs to obtain a block sequence. If multiple data blocks have completely identical SFPs, they are sorted according to their IDs. For example, if the SFPs of data blocks A through E are completely identical, the data block order after sorting by SFP lexicographical order is still: A, B, C, D, E.

[0088] Next, as shown in step S13 of Figure 4, the data blocks output by the feature data extraction module 101 can be processed by the hierarchical graph index construction module 102. In some possible implementations, all data blocks output by the feature data extraction module 101 can be divided into multiple batches, and each batch can be input into the hierarchical graph index construction module 102 to construct the corresponding hierarchical graph index structure. This facilitates subsequent batch-by-batch analysis of duplicate data and reduction operations, thereby reducing the resource consumption of the deduplication task through this fine-grained data block processing. Of course, in other possible implementations, the hierarchical graph index construction module 102 can also construct a hierarchical graph index structure for all data blocks output by the feature data extraction module 101, with the principle being similar to constructing a hierarchical graph index structure for a single batch of data blocks.

[0089] The following section uses the example of the hierarchical graph index building module 102 building a hierarchical graph index structure for data blocks in batches to illustrate the construction principle of the index structure.

[0090] For example, as shown in Figure 5, the data processing device 100 may further include an index structure reuse module 105. The index structure reuse module 105 can determine the aggregation conditions of data blocks in the cache based on the data block sequence output by the feature data extraction module 101, so that there are as many similar data blocks as possible in each batch of data blocks formed according to the aggregation conditions. The aggregation conditions may include: data blocks in the aggregated data block sequence whose first byte of SFP is the same, and the total size of a batch of data blocks does not exceed the capacity of the corresponding cache, etc., but are not limited to these.

[0091] Taking the aforementioned feature data extraction module 101 as an example, several data blocks A, B, C, ... are generated. Data blocks A through E have the same SFP (or the first byte of their SFPs is the same), and each data block is 8 KiB in size. If the buffer size is 5 × 8 KiB, meaning the buffer can process a maximum of 5 data blocks at a time, the aggregation conditions are adjusted so that the 5 data blocks A through E can be processed in the same batch. If the buffer size is 2 × 8 KiB, the aggregation conditions are adjusted so that the 5 data blocks A through E can be divided into 3 batches for processing, such as data blocks A through B in one batch, data blocks C through D in another batch, and the remaining data block E in yet another batch. In this way, in large-scale storage scenarios, a large number of data blocks can be divided into multiple batches based on certain similarities (e.g., dividing 1000 data blocks into a batch based on buffer capacity, etc.), resulting in low resource consumption. Furthermore, each batch of data blocks can be entirely cached in memory, reducing the frequency of data swapping in and out of memory during computation and thus reducing memory consumption.

[0092] In this embodiment, after batching the data block sequence, the hierarchical graph index construction module 102 constructs a corresponding hierarchical graph index structure for each batch of data blocks. Specifically, the hierarchical graph index construction module 102 can first perform matching calculations based on the similarity fingerprints (SFPs) of data blocks in a single batch. If any pair of data blocks in a single batch has matching SFPs, then the pair of data blocks is determined to be likely similar. Thus, one or more pairs of data blocks with matching SFPs can be selected from a single batch, i.e., similar block pairs. Then, the hierarchical graph index construction module 102 constructs a corresponding hierarchical graph index structure for the data block pairs within a single batch, that is, for all data blocks with similar blocks within a single batch. For data blocks in a single batch that have no similar blocks, it is not necessary to construct a hierarchical graph index structure for them. In this way, in large-scale storage scenarios, potentially similar data block pairs can be found through SFP filtering for subsequent duplicate data analysis processing, while data blocks that are unlikely to have similar blocks are abandoned, avoiding unnecessary calculations that waste resources and improving the reduction speed.

[0093] It should be noted that in other possible implementations, when resources are sufficient or the reduction speed requirement is low, SFP filtering can be omitted, and a hierarchical graph index structure can be directly built for all data blocks in a single batch, which will not be described in detail here.

[0094] In this embodiment, after filtering out data blocks with similar characteristics, the hierarchical graph index construction module 102 can construct a corresponding hierarchical graph index structure for each batch of filtered data blocks. This hierarchical graph index structure is an undirected graph structure with multiple levels. Each level is an undirected, unweighted graph, and each level can store several nodes, each representing a data block. Two connected nodes on a level are neighbors, representing a pair of data blocks with a similar relationship. In other words, the edges of a level only indicate that two nodes are related and do not contain other information. The higher the level of the layer, the higher the similarity of the data block pairs on that layer. For example, the hierarchical graph index construction module 102 can first create an initialized hierarchical graph index structure and determine the number of levels and the number of nodes it can accommodate during initialization. Thus, when the hierarchical graph index construction module 102 generates the corresponding hierarchical graph index structure for a single batch of data blocks, it can add each data block pair within the single batch to the corresponding layer based on their similarity. The following description still uses the hierarchical graph index structure construction process for a single batch of data block pairs as an example.

[0095] For example, the hierarchical graph index construction module 102 calculates the GFP distance between each data block and its similar blocks selected within a single batch to determine the degree of similarity. Data block pairs with a similarity threshold are then added to the corresponding level of the hierarchical graph index structure along with the edges representing their similarity relationship, becoming neighbors. The preset threshold can be a value greater than 0, but is not limited to this. Furthermore, the similarity of a data block pair can be quantified by how many dimensions of the GFP of the two data blocks are identical. Thus, provided the similarity of the data block pair is greater than the preset threshold, the level at which the data block should be distributed in the hierarchical graph index structure can be determined based on the similarity of the data block pair and a pre-set hierarchical coefficient, where the hierarchical coefficient is a constant.

[0096] The hierarchy of data block pairs in the hierarchical graph index structure can be calculated using the following equation:

[0097] In this example, since any data block in a single batch may be similar to multiple other blocks, any data block can be added to different layers of the hierarchical graph index structure along with different similar blocks. For example, if data block pair (A, B) has 11 dimensions of identical GFP information (i.e., a similarity of 11 and a hierarchy coefficient of 2), then according to Equation 1, the layer number of data blocks A and B is 11 / 2 rounded up, thus they are distributed in the 6th layer of the hierarchical graph index structure and are neighbors. If the similarity of data block pair (A, C) is 6, then according to Equation 1, the layer number of data blocks A and C is 6 / 2 rounded up, thus they are distributed in the 3rd layer of the hierarchical graph index structure and are neighbors. Therefore, data block A is located in the 6th and 3rd layers of the hierarchical graph index structure, and the maximum layer number (or highest layer number) of data block A is 6.

[0098] For example, the hierarchical graph index building module 102 can generate a corresponding tag array for each data block to record all levels and the maximum level where the data block is located. For example, if data block A is located in level 6 and level 3 respectively, its tag array can be represented as: {number of occurrence levels: 6, 3; maximum level 6}.

[0099] Furthermore, if the similarity between a pair of data blocks is insufficient to be added to any layer of the hierarchical graph, for example, if the similarity between the two data blocks is 0, then the pair of data blocks is determined to be dissimilar and will not be added to the index structure. This avoids reducing the number of low-value data block pairs into the hierarchical graph index structure and slowing down the speed of subsequent similar data retrieval.

[0100] In some possible implementations, to improve the efficiency of constructing the hierarchical graph index structure for a single batch of data blocks, the hierarchical graph index construction module 102 can utilize a pre-configured sliding window to calculate the similarity between a data block and subsequent data blocks within the sliding window when calculating the similarity between data blocks, thereby limiting the number of calculations for each data block. It is understandable that although the hierarchical graph index construction module 102 calculates the similarity of data blocks obtained after the initial screening of SFPs, these data blocks are sorted lexicographically by SFP, meaning that the blocks near a data block are those most likely to be similar or identical to it. Therefore, the calculation range set by the sliding window can basically cover the similar blocks of the data block.

[0101] Therefore, specifically, the sliding window size is pre-configured to W (W≥2) data blocks. Starting from the first data block in a single batch, the similarity between it and subsequent data blocks within the sliding window W is calculated to obtain the similarity of corresponding data block pairs. In this way, for any given data block, only its similarity with the following W data blocks is calculated (equivalent to calculating the similarity of each data block (excluding the first and last data blocks) with its preceding W data blocks and its following W data blocks), thereby controlling the number of times the similarity is calculated for each data block, thus reducing computational and memory overhead. For example, as shown in Figure 6, taking a sliding window W=2 and calculating the similarity of several data blocks A, B, C, ... as an example, first, the similarity between data block A and data blocks B and C located within the sliding window W is calculated. If the hierarchy coefficient is 2, the GFP matching degree (similarity) between data block A and data block B is 4, and the GFP matching degree (similarity) between data block A and data block C is 0. Then, data block A and data block B are added to the second layer of the hierarchical graph index structure and connected. The similarity between data block A and data block C does not reach the threshold, so they are not added to any layer. Similarly, the sliding window calculates the GFP matching degree between data block B and data blocks C and D, then calculates the GFP matching degree between data block C and data blocks D and E, and so on.

[0102] In this way, the number of neighbors of a node in the hierarchical graph index structure can be controlled by the sliding window, which in turn controls the size of the hierarchical graph index structure. This saves memory resources, improves data reduction speed, and consequently reduces the impact on front-end business operations.

[0103] In this embodiment, after the hierarchical graph index construction module 102 adds data block pairs with a similarity level reaching a preset threshold within a single batch to the hierarchical graph index structure, the hierarchical graph index retrieval module 103 can perform duplicate data retrieval analysis on the data blocks in a single batch. It can traverse the hierarchical graph index structure to find similar blocks with varying similarity levels from high to low without comparison calculations. Since each data block may be a neighbor of other data blocks at any level, for any data block, a depth-first search strategy can be performed between levels and a breadth-first search strategy within a level. That is, when analyzing the similarity between data blocks, the higher-level data block and all its neighbors can be analyzed first. After visiting its higher-level neighbors, all its lower-level neighbors are visited. If a lower-level data block also exists at a higher level, the search jumps back to that higher level to quickly find the most valuable similar blocks for reduction, thus improving the reduction effect.

[0104] The following section uses the example of performing a retrieval on a hierarchical graph index structure built from data blocks in a single batch.

[0105] Specifically, data blocks within a single batch are sequentially selected and entered into the hierarchical graph index retrieval module 103. Once the selected data blocks complete the subsequent retrieval process, they are marked. This way, if a data block is already marked when it is selected, the selection of that data block is abandoned, and the next data block is selected instead. This avoids repeatedly selecting and analyzing the same data blocks, thus preventing waste of computing and memory resources and reducing data reduction efficiency. Successfully selected data blocks are the next data blocks to be accessed. In this example, successfully selected data blocks that are being accessed for the first time are added to the traversal queue (also referred to as the first queue in this paper), and a priority queue (also referred to as the second queue in this paper) is used to store information about the next data blocks to be accessed. The priority queue stores data block information in the form of tuples, which can be represented as <highest level of data block, data block id>. The priority queue is arranged according to the first element of the tuple, that is, sorted from high to low level of data block hierarchy. If data blocks are at the same level, they are sorted according to the lexicographical order of their data block ids. The head of the priority queue is the candidate node for the next access. For example, if the data blocks A to E are selected sequentially, the first selected data block A has a highest level of 2, and the resulting tuple can be represented as <2, A>. This tuple is added to the head of the priority queue, and at this time, data block A becomes the candidate data block for the next access.

[0106] Next, using a depth-first and breadth-first strategy between levels, based on the tuples of the candidate data block and its highest level in the hierarchical graph index structure, all neighbors of the candidate data block are traversed, and the highest level of these neighbors is determined. The tuples of these neighbors are then added to a priority queue. Simultaneously, each data block traversed for the first time (including candidate data blocks and their neighbors) is added to a traversal queue in the order it was traversed. This traversal queue will serve as the grouping result for subsequent data block merging and compression. During this process, when any candidate data block has traversed all its neighbors at a certain level, the relevant records for that data block at that level are cleared. Thus, after a single batch of data blocks has been traversed, all data block records in the hierarchical graph index structure are empty.

[0107] Let's take the five data blocks A to E mentioned above as an example to illustrate the retrieval process for a single batch of data blocks. Referring to Figure 7, when the hierarchical graph index retrieval module 103 first accesses data blocks A to E, the priority queue is empty, so it traverses sequentially starting from data block A. When data block A is accessed for the first time, data block A is added to the traversal sequence, and the tuple information <2, A> of data block A is added to the priority queue. In this way, the priority queue is not empty, and the tuple <2, A> can be dequeued from the priority queue. At this time, data block A is a candidate data block.

[0108] In this example, the depth-first search strategy is first applied, as shown in Figure 7(7a), starting from the highest level of data block A in the hierarchical graph index structure. Based on the tuple <2, A>, the highest level of data block A is level 2. Therefore, the hierarchical graph index retrieval module 103 first searches at level 2 and then, according to the breadth-first search strategy, traverses all neighbors of candidate data block A within the scope of level 2. Specifically, the neighbor of candidate data block A at level 2 is data block B. Since data block B is being traversed for the first time, it is added to the traversal sequence after data block A. Furthermore, based on the current tag array of data block B, the tuple <2, B> of data block B is obtained and added to the priority queue.

[0109] Next, since data block A has no other neighbors at level 2, the access to data block A at level 2 is now complete. The records related to data block A at level 2 are cleared, meaning that logically, data block A no longer exists at level 2. The tag array for data block A is then modified to {occurrence level: 0; maximum level: 0}, as shown in Figure 7(7b). At this point, attempting to obtain the maximum level information for data block A yields 0, and no more tuples can be obtained. Therefore, data block A is no longer added to the priority queue. The contents of the priority queue are in the order of (<2, B>), and the traversal queue is in the order of (A, B). Then, the first tuple <2, B> of the priority queue is dequeued, and data block B becomes a candidate data block. Data block B's neighbor at level 2 was originally data block A, but since the records related to data block A have been cleared at level 2, data block B has no other neighbors at level 2, and its access at level 2 is considered complete.

[0110] Then, clear the record of data block B at level 2, and modify the tag array of data block B accordingly to {occurrence level: 1; maximum level: 1}, as shown in Figure 7(7c). Thus, to obtain the maximum level of data block B as 1, the tuple <1, B> is added to the priority queue. Next, since the priority queue is not empty, the head tuple <1, B> is dequeued, and the neighbors of data block B at level 1 are traversed. Data block C is found to be a neighbor of data block B at level 1, so data block C is added to the traversal sequence. Based on the highest level recorded in the tag array of data block C, the tuple <2, C> is added to the priority queue.

[0111] At this point, data block B has no neighbors in level 1. The tag array for data block B is modified to {occurrence level: 0; maximum level: 0}, as shown in Figure 7(7d). Data block B in level 1 becomes a one-way neighbor of data block C. Logically, data block B no longer exists in this level, so the current data block traversal sequence is (A, B, C); the priority queue contents include (<2, C>) in order. Similarly, the first tuple <2, C> of the priority queue is dequeued. Data block C is then used as a candidate data block, and its neighbor data block D in level 2 is added to the traversal sequence. Simultaneously, the tuple <2, D> of data block D is added to the priority queue. At this point, data block C has completed traversal in level 2. As shown in Figure 7(7d), the occurrence level of data block C becomes 1, the maximum level is synchronously updated to 1, and the tuple information <1, C> of data block C is added to the priority queue. At this point, data block C on the second layer becomes a one-way neighbor of data block D, and the data block traversal sequence is (A, B, C, D); the contents of the priority queue include (<2, D>, <1, C>) in order.

[0112] Similarly, continue dequeuing the first tuple <2, D> from the priority queue. Data block D is being accessed for the first time, so it is added to the data block traversal sequence. Data block C, the neighbor of data block D at level 2, is logically considered not to exist in the current index structure. Therefore, data block D has no neighbors at level 2, so the occurrence level marker array for data block D is modified to 0, and the maximum level is 0, as shown in Figure 7(7e). At this point, data blocks C and D at level 2 are logically considered not to exist, and the data block traversal sequence is (A, B, C, D); the contents of the priority queue include (<1, C>) in sequence. Next, dequeue the first tuple from the priority queue to obtain the tuple <1, C>. Data block C is among the neighbors B and E at level 1. Data block B is logically considered not to exist, so only data block E is added to the traversal sequence. The highest level of data block E is set to 1, resulting in the tuple <1, E>, which is added to the priority queue. At this point, all the first-level neighbors of data block C have been traversed. The occurrence level of data block C is updated to 0, and the maximum level is synchronously updated to 0, as shown in Figure 7(7f). Data block C on the first level becomes a one-way neighbor of data block E, and the current data block traversal sequence is (A, B, C, D, E); the contents of the priority queue include (<1, E>) in order. Continue to dequeue the tuple <1, E> at the head of the priority queue. The neighbor of data block E on the first level is data block C. Data block C is already in the traversal sequence, and the highest level of data block C is 0, making it a one-way neighbor of data block E. Therefore, it does not need to be added to the priority queue. The occurrence level of data block E becomes 0, and the maximum level is synchronously updated to 0. At this point, the hierarchical graph index structure is logically cleared and does not contain any data blocks. Its current data block traversal sequence is (A, B, C, D, E); the priority queue is empty.

[0113] For example, the tag arrays of all data blocks are checked sequentially to determine if any data blocks can be selected (i.e., the highest level is not zero). After the check is complete, the unselected data blocks are selected in sequence for analysis and retrieval. For instance, if data blocks ABCDEFG are a batch of data blocks, and the relationship between data blocks A, E, and E is as shown above, then data blocks F and G have no similarity relationship with data blocks A through E. Since data blocks A through E have already been selected, data blocks F and G need to be selected in sequence to perform a similar analysis and retrieval process to that of data blocks A through E. If data blocks F and G have similar blocks, a corresponding traversal queue is obtained. If no other similar blocks are found, they can be directly written to the persistent storage layer 200.

[0114] In this embodiment, after the hierarchical graph index retrieval module 103 completes the access and traversal of a batch of data blocks, the records for that batch of data blocks in the hierarchical graph index structure are empty. Therefore, the hierarchical graph index construction module 102 can retrieve the next batch of data blocks and reuse the hierarchical graph index structure to represent the similarity relationships and similarity levels between the next batch of data blocks. In this way, different batches of data blocks can reuse the memory resources occupied by the same hierarchical graph index structure during the entire data reduction process, reducing the repeated allocation of memory resources by different batch reduction tasks. That is, only one allocation of memory resources is needed for the hierarchical graph index structure, which can then be used to construct the index structures for multiple batches of data blocks. This reduces memory consumption and improves data reduction speed.

[0115] In this embodiment, since the hierarchical graph index structure intuitively represents the similarity between a data block and its various similar blocks, it is possible to directly traverse the similar blocks from high to low similarity without comparison calculation. This not only has low search overhead and high efficiency, but also reflects these similarity relationships through the traversal queue. Therefore, subsequent grouping and reduction of the traversal queue can be performed directly without additional calculation, resulting in a high reduction ratio.

[0116] Specifically, by traversing the hierarchical graph index structure, neighboring data blocks with high similarity in the traversal queue are adjacent to each other. Therefore, the data reduction module 104 does not need additional calculations and can directly group the data blocks in the traversal queue in order and perform data reduction on the groups to obtain a better data reduction ratio. As an example and not a limitation, the grouping granularity can be no less than 3 data blocks, and the remaining data blocks can also be grouped together if there are not enough data blocks left at the end.

[0117] For example, as shown in the figure, if the data block traversal sequence ABCDE obtained from the hierarchical graph index efficient retrieval module is grouped, taking 5 data blocks as a group, then data blocks ABCDE are grouped together for merging and compression. If 3 data blocks are grouped together, then data blocks ABC are grouped together, and data blocks DE are grouped together. Among these, data blocks A and B are the most similar block pair, data block B is also the most similar block to data block C. In addition, data block D is a block with slightly lower similarity to data block C, but it is the most similar block to data block E. It can be seen that by directly grouping at a preset granularity, highly similar blocks can still be grouped together, thereby achieving a better compression ratio in subsequent reduction processing.

[0118] In this example, the merge compression module 104 can copy the block content of data blocks within a single group to a buffer, and then call the merge compression interface to merge and compress the data block content in the buffer. If the merge compression is successful, the reduction process of the entire batch of data is completed. If the merge compression fails, it checks which groups failed, copies the data blocks of that group from the traversal sequence, and tries to merge and compress them again.

[0119] For example, due to constraints or performance trade-offs in the data processing device, the number of groups to traverse the queue or the upper limit of the compressed data block length can be predefined to avoid placing an excessive burden on the system.

[0120] Next, based on the content described above, a data processing method provided by an embodiment of this application will be introduced. It is understood that this method is proposed based on the content described above, and some or all of the content of this method can be found in the description above.

[0121] For example, Figure 8 shows a schematic flowchart of a data processing method provided in an embodiment of this application. It should be understood that this method can be executed on any device, platform, equipment, or cluster of devices with computing power, such as the devices / device clusters shown in Figures 1, 3, and 5, but is not limited thereto. The following description uses execution on a computing device as an example. As shown in Figure 8, the method may include:

[0122] S801, determine the characteristic data of several data blocks.

[0123] In this embodiment, the computing device or cluster can receive data written by the user and divide it into data blocks. These data blocks can be characterized by their respective feature data, so that the similarity and / or degree of similarity of the content can be determined based on the metadata between the data blocks.

[0124] For example, the feature data of a data block may include a gradient fingerprint (GFP), which is a multi-dimensional vector formed by multiple feature values of the data block. These feature values can be calculated using different hash functions on the content of the data block, thus reflecting the content characteristics of the data block in different dimensions. If two data blocks have the same feature value in any same dimension of the GFP, it indicates that the two data blocks are similar in that dimension. The more common dimensions that data blocks exhibit in the GFP dimension, the higher the degree of content similarity between the data blocks.

[0125] S802, based on the gradient fingerprint matching degree of the data block pairs formed by several data blocks, generate a hierarchical graph index structure for several data blocks.

[0126] In this embodiment, these data blocks can form multiple data block pairs. For any data block pair, the similarity of the data block pair can be calculated by calculating the matching degree of its gradient fingerprint, thereby generating a corresponding hierarchical graph index structure.

[0127] For example, this hierarchical graph index structure includes multiple layers, each of which is an undirected, unweighted graph. Each layer can hold multiple data blocks, and two data blocks (i.e., data block pairs) that are similar to each other on each layer can be connected by edges. It should be noted that the edges between similar blocks are only used to indicate that they are similar. Furthermore, a data block can be similar to multiple other data blocks belonging to the same layer, and a data block can also be similar to multiple data blocks on different layers.

[0128] For example, the layer hierarchy of a hierarchical graph index structure can be used to characterize the similarity of pairs of data blocks contained in a layer. The higher the similarity of any pair of data blocks, the higher the layer hierarchy is distributed within the index structure. Of course, in some possible implementations, the higher the similarity of any pair of data blocks, the lower the layer hierarchy is.

[0129] In this way, by calculating the degree of matching of gradient fingerprints of multiple data block pairs in several data blocks, these data block pairs can be distributed to the corresponding levels of the hierarchical graph index structure. The level to which the data block pairs are distributed can be calculated using Equation 1 above, and will not be elaborated further here.

[0130] S803, following the hierarchical order of multiple layers, traverse each of the second data blocks that are data block pairs with the first data block from the layered graph index structure.

[0131] In this embodiment, after the user-written data blocks are distributed to the corresponding layers of the hierarchical graph index structure according to the similarity of the block pairs, any data block (i.e., the first data block) among the above-mentioned data blocks can be used as a candidate block to search in the hierarchical graph index structure to find each similar block (i.e., the second data block) with the similarity of the first data block from high to low.

[0132] For example, during retrieval, similar blocks to the first data block can be searched layer by layer within the hierarchical graph index structure. Specifically, layers representing high similarity can be traversed first, followed by layers representing low similarity. This eliminates the need for GFP matching calculations between data block pairs, allowing for rapid identification of highly similar blocks. This high efficiency in similar block retrieval accelerates data reduction. Furthermore, since traversing this index structure also yields data block pairs with low similarity, and these pairs can undergo incremental compression, the retrieved data block pairs also contribute to improving the data reduction ratio.

[0133] S804, based on the traversal results, performs data reduction processing on the first and second data blocks.

[0134] In this embodiment, since the second data blocks are obtained by traversing in descending order of similarity, this traversal order can be recorded in the traversal results. This ensures that the second data blocks with higher similarity in this traversal order are closer to the first data block. Therefore, without additional calculations, grouping can be performed directly based on the traversal results. Adjacent data blocks with higher similarity are grouped together for reduction processing. Reduction processing includes, but is not limited to, Delta compression and / or merge compression, achieving a high reduction ratio. Specifically, a Delta compression algorithm can be used, with the first data block as the reference block and at least one second data block as the target block. The second data block is compressed based on the first data block; the greater the similarity between the reference block and the target block, the higher the compression ratio. Alternatively, a merge compression algorithm can be used, concatenating the first data block with at least one second data block before compression.

[0135] In this embodiment, the computational and time overhead is small during the entire process of searching and reducing similar blocks, and the memory occupied by the hierarchical graph index structure is also small, which helps to reduce the impact on front-end I / O.

[0136] In some possible implementations, the data processing method provided in this application embodiment can be implemented as shown in Figure 9. Specifically, the method may include:

[0137] The S900 divides the data written by the user into several data blocks.

[0138] In this step, the user-written data can be divided into several data blocks of equal size according to a preset granularity. The preset granularity can be 8 kibits (KiB), or a larger or smaller granularity, such as 4 KiB, 64 KiB, etc., but is not limited to these.

[0139] S901, determine the feature data of the data block, including similar fingerprints and gradient fingerprints.

[0140] In this step, a hash function can be used to calculate the similarity fingerprint and gradient fingerprint of each data block.

[0141] For example, the similarity fingerprint (SFP) of a data block can characterize the similarity features of the data block's content. If two data blocks have the same SFP, then the content of the two data blocks is completely identical or partially identical. The similarity fingerprint of each data block can be calculated using a weak hashing algorithm, making it easier to generate the same similar fingerprint even when there is little duplicate content between data blocks. Furthermore, the gradient fingerprint (GFP) is used to characterize the similarity features of the data block's content in different dimensions, and will not be elaborated further.

[0142] S902, sort the data blocks according to the lexicographical order of similar fingerprints to obtain the block sequence.

[0143] In this step, data blocks can be sorted lexicographically according to their similar fingerprints (SFPs) to obtain a data block sequence. If data blocks have identical similar fingerprints (SFPs), they are sorted lexicographically. In this way, in the resulting block sequence, data blocks with the same similar fingerprints (SFPs) can be placed in adjacent positions, meaning that potentially similar data blocks are adjacent to each other.

[0144] S903, From the block sequence, determine a subset of blocks in which any data block is a data block pair with at least one other data block, and the data blocks in the data block pairs have the same similar fingerprint.

[0145] In this step, a block sequence can be treated as a batch, allowing for the construction of a corresponding hierarchical graph index structure for the entire block sequence. Before constructing the index structure, similarity fingerprint (SFP) matching can be performed between the data blocks in the block sequence to filter out data blocks that can form data block pairs with at least one other data block. This filters out all data blocks in the block sequence that have similarity fingerprints, resulting in a subsequence, or block subset. For example, if the block sequence includes data blocks A through G, where data blocks A through E have identical similarity fingerprints, and data blocks F and G do not match any other block in the sequence, then data blocks A through E form a block subset. The order of data blocks A through E in the block subset is the same as their order in the block sequence.

[0146] In this way, in large-scale storage scenarios, SFP filtering can be used to filter data blocks with similar blocks for subsequent duplicate data analysis processing based on hierarchical graph index structure, and data blocks that are unlikely to have similar blocks can be abandoned for analysis. This helps to reduce the amount of block data in the corresponding index structure, avoid unnecessary calculations that waste resources, and also help to improve the reduction speed.

[0147] For example, since the similar fingerprints of data blocks in this example are weak hash fingerprints, the same similar fingerprints are easily generated when there is little duplicate content between data blocks. Therefore, in this step, some data block pairs with low similarity can also be filtered out, which helps to improve the subsequent data reduction ratio.

[0148] S904 uses a preset sliding window to calculate the gradient fingerprint matching degree between each data block in the block subset and the subsequent data blocks located within the sliding window.

[0149] In this step, the sliding window size can be pre-configured to W (W≥2) data blocks. Starting from the first data block in the subset, the similarity between each block and the subsequent data blocks within the sliding window is calculated, thus obtaining the similarity of corresponding data block pairs. In this way, for any given data block, only its similarity with the next W data blocks is calculated (equivalent to calculating the similarity of each data block (excluding the first and last data blocks) with its previous W and next W data blocks), thereby controlling the number of similarity calculations for each data block and reducing computational and memory overhead.

[0150] S905 generates a corresponding hierarchical graph index structure based on the gradient fingerprint matching degree of each data block pair.

[0151] In this step, an initialized hierarchical graph index structure can be pre-constructed. This structure has layers with preset hierarchical levels, each layer having an upper limit on the number of data blocks it can hold, and a hierarchy coefficient is set for the initialized hierarchical graph index structure. Then, using Equation 1 above, the layers to which each data block pair should be distributed can be calculated based on the gradient fingerprint matching degree and hierarchy coefficient of each data block pair. These layers are then added to the layers, and edges are used to connect two data blocks that are data block pairs, indicating that these two data blocks are neighbors.

[0152] Furthermore, for example, a tag array can be generated for each data block in the hierarchical graph index structure. The tag array is used to record all layer levels and the highest layer level of the data block in the hierarchical graph index structure. For example, if data block A is located at layer 6 and layer 3 in the hierarchical graph index structure, its tag array can be represented as: {Number of occurrence: 6, 3; Maximum layer: 6}.

[0153] In this embodiment, after obtaining the hierarchical graph index structure of the block subset, similar blocks of each data block in the block subset can be searched based on the hierarchical graph index structure. Specifically, this method further includes the following steps S906 to S908 to realize the query for similar blocks.

[0154] S906: Select the first data block from several data blocks and add it to the first queue.

[0155] In this step, after obtaining the hierarchical graph index structure of the block subset, data blocks within the block subset can be selected sequentially as candidate data blocks (i.e., the first data block) to search for all similar blocks of the first data block from high to low similarity in the hierarchical graph index structure.

[0156] For example, for a subset of blocks, a data block can be selected sequentially for the subsequent similar block retrieval process. After completing the similar block retrieval process for this data block, it is marked, and then the next data block is selected sequentially to perform the similar block retrieval process. In this way, if a data block has already been marked when it is selected, it is skipped and the next data block is selected instead, thereby avoiding repeated retrieval of data blocks.

[0157] For example, the first candidate data block accessed can be added to the traversal queue (i.e., the first queue), and the data blocks in the first queue are sorted according to the order in which they were added. It should be understood that adding a candidate data block to the first queue can actually be done by adding the candidate data block's ID to the queue.

[0158] In this embodiment, after identifying candidate data blocks, a search can be performed using a strategy of depth-first search between levels and breadth-first search within levels. That is, all neighbors of the candidate data at higher levels are analyzed first, and then all neighbors at lower levels are searched only after all higher-level neighbors have been visited. Furthermore, if a candidate data block at a lower level also exists at a higher level, the search jumps back to that higher level to quickly find the most valuable similar blocks for reduction, thus improving the reduction effect.

[0159] For example, after step S906, this method may specifically include:

[0160] S907, based on the tag array of the first data block, add the information of the first tuple to the second queue.

[0161] In this example, as a candidate data block, the tag array of the first data block can be obtained first, and the binary information of the first data block (i.e., the first binary information) can be generated based on this array. This binary information is then added to the priority queue (i.e., the second queue). The second queue is used to store the binary information of subsequent data blocks to be accessed. The binary information includes the number of the highest layer level of the data block in the hierarchical graph index structure.

[0162] Furthermore, as a concrete example, the priority queue is arranged according to the first element of the tuple, that is, sorted from high to low hierarchy of data blocks. If data blocks are at the same hierarchy, they are sorted lexicographically by their data block IDs. The head of the priority queue contains candidate nodes for the next access.

[0163] S908, if the first tuple information is located at the head of the second queue, then in the highest layer described by the first tuple information, the second data block corresponding to the first data block is traversed, and the second data block traversed for the first time is added to the first queue, and the tuple information of each traversed second data block is added to the second queue.

[0164] In this example, if the tuple information of the first data block is at the head of the second queue, the tuple information is dequeued, the first data block is taken as a candidate data block, and the similar blocks of the first data block are traversed in the highest layer described by the tuple information, that is, the corresponding second similar blocks are traversed.

[0165] Furthermore, in the highest layer of the current traversal, each second data block encountered for the first time is added to the first queue in traversal order. In addition, for each second data block that is traversed, its own binary information is generated based on its tag array and added to the second queue, and then sorted according to the sorting method of the second queue.

[0166] For example, as shown in Figure 7(7a), if data block A is a candidate data block, then in its highest layer of distribution, i.e., the second layer, similar blocks that are neighbors are queried. At this time, the neighboring data block B is traversed, and since data block B is being visited for the first time, it is added to the first queue, and the tuple <2, B> of data block B is added to the second queue.

[0167] S909, clear the records of the first data block in the highest layer, and update the tag array of the first data block.

[0168] In this example, after traversing all similar blocks in the current highest layer of the first data block, the record of the first data block in that highest layer is cleared, making it logically non-existent in that layer. The tag array of the first data block is then updated. For example, as shown in Figure 7(7b), data block A only appears in the second layer of the index structure. After clearing its record in the second layer, the tag array of data block A is then modified to {occurrence level: 0; maximum level: 0}.

[0169] S910, based on the updated tag array, generate the second tuple information and add it to the second queue.

[0170] In this example, the tag array of the first data block is updated. If the highest layer level in the updated tag array is not 0, new tuple information (also referred to as second tuple information in this article) is generated and added to the second queue for sorting.

[0171] For example, if the highest layer level in the updated tag array of the first data block is 0, then new tuple information cannot be obtained, or in other words, the tuple information is unavailable. In this case, it is considered that the traversal of similar blocks of the first data block has been completed.

[0172] For example, as shown in Figure 7(7b), when the updated tag array of candidate data block A is {occurrence level: 0; maximum level: 0}, no binary information can be generated. However, as shown in Figure 7(7c), when the candidate data block is data block B, after traversing its highest level, the tag array is updated to {occurrence level: 1; maximum level: 1}, and the highest level is not 0, generating a new binary information <1, B> which is added to the priority queue for sorting.

[0173] S911, if the second tuple information is located at the head of the second queue, then the highest layer described by the second tuple information is traversed to traverse the second data blocks corresponding to the first data block until all the second data blocks corresponding to the first data block are obtained.

[0174] In this example, if the second tuple information of the first data block is at the head of the second queue, then the second tuple information is dequeued, and the first data block continues as a candidate data block, thus traversing similar blocks in the highest layer described by the second tuple information. Similarly, the second data block accessed for the first time in this layer is added to the first queue, and the tuple information of the second data block is added to the second queue. After completing the traversal of the current highest level, the record of the first data block on this layer is cleared.

[0175] It should be noted that if the second tuple information is not found at the head of the second queue, the head tuple will be dequeued, and the data block to which the dequeued tuple belongs will be used as a candidate data block for the retrieval process as described in S909 to S911 above.

[0176] Furthermore, after completing the traversal of similar blocks of the first data block, if the second queue is empty, the next data block is selected sequentially from the block subset to perform the retrieval process as described in S909 to S911 above.

[0177] In this way, after finding similar blocks for all data blocks in the block subset from the hierarchical graph index structure, a first queue is obtained to record the access order of each data block that is accessed for the first time. Since the entire similar block retrieval stage is carried out according to the principle of depth-first between hierarchies and then breadth-first within hierarchies, the data blocks with higher similarity in the first queue are closer to each other.

[0178] Therefore, in this embodiment, instead of relying entirely on similar fingerprints to query similar blocks, it uses weak hash fingerprints like SFP to identify potentially similar data block pairs. These pairs are then placed in a hierarchical graph index structure for similarity analysis, allowing more data block pairs with low similarity to participate in the analysis. The analysis is performed by directly traversing the hierarchical graph index structure. This avoids caching all data block metadata in memory, enabling high concurrency or memory reuse to save resources. Furthermore, it minimizes computational complexity, obtaining the similarity relationships between similar block pairs and reducing the difficulty of finding similar blocks. This not only improves search efficiency but also enhances the data reduction ratio without incurring excessive computational overhead, thus having a minimal impact on the main I / O process.

[0179] S912, divide all data blocks in the first queue into groups according to a preset granularity.

[0180] In this step, since data blocks with higher similarity in the first queue are closer to each other, no additional calculation is needed to directly group the data blocks in the first queue sequentially and perform data reduction by group to obtain a better data reduction ratio. For example, and not a limitation, the grouping granularity can be no less than 3 data blocks. For example, as shown in the figure, after searching data blocks A to E from the hierarchical graph index structure, the first queue includes (A, B, C, D, E). A, B, and C can be grouped together, and D and E can be grouped together, according to the preset granularity. Here, data blocks A and B are the most similar pair, data block C is slightly less similar to data block B, and data block C is also the most similar to data block D, with a low similarity to data block E. It can be seen that directly grouping by the preset granularity can still divide highly similar blocks into one group, thereby obtaining a better compression ratio in the subsequent reduction process.

[0181] S913 reduces the size of the data blocks in each group.

[0182] In this step, the contents of data blocks within a single group can be copied to a buffer. Then, the merge and compression interface is called to merge and compress the contents of the data blocks in the buffer. If the merge and compression is successful, the reduction process for the entire batch of data is completed. If the merge and compression is unsuccessful, it is checked which groups failed, and the data blocks of that group are copied from the traversal sequence to try to merge and compress again, thereby deleting redundant data.

[0183] In some possible implementations, in large-scale storage scenarios, this embodiment can further divide the user-written data into blocks and then perform finer-grained batch processing. This allows for the construction of an index structure and the performance of retrieval analysis and reduction processing only on a single batch of data blocks each time, resulting in low resource consumption. Therefore, specifically, as shown in Figure 10, this method can include:

[0184] S1001 divides several data blocks into multiple batches.

[0185] In this step, similarity fingerprints and gradient fingerprints can be calculated for several data blocks, and these data blocks are sorted lexicographically according to the similarity fingerprints to obtain a block sequence. Then, aggregation conditions are determined for this block sequence. The aggregation conditions are used to select blocks that meet the aggregation conditions from the block sequence to form a smaller-granularity subsequence, which is considered as a batch. As an example, aggregation conditions may include: data blocks in the aggregated data block sequence whose first byte of SFP is the same, and the total size of a single batch of data blocks does not exceed the capacity of the corresponding buffer, etc., but are not limited to these.

[0186] For example, if the first byte of similar fingerprints of data blocks A to E in the block sequence is the same, and the buffer can load a maximum of 2 data blocks, then according to the aggregation condition, data blocks A and B can be grouped together, data blocks C and D can be grouped together, and the remaining data block E can be grouped together.

[0187] S1002, add the data block pairs in the first batch of data blocks to the hierarchical graph index structure according to the gradient fingerprint matching degree.

[0188] In this step, an initialized hierarchical graph index structure is generated in advance. After the batch division of the data block sequence is completed, data blocks are acquired batch by batch for processing. Specifically, for any single batch of data blocks acquired (i.e., the first batch), the gradient fingerprint matching degree of each data block pair in that batch is calculated, and each data block pair in that batch is added to the hierarchical graph index structure according to Equation 1 above. The specific process can also be found in the description of steps S602 or S705 above, and will not be repeated here.

[0189] S1003, according to the hierarchical order of multiple layers, traverse each second data block that is a data block pair with the first data block from the layered graph index structure, and clear the record of the first data block in the layered graph index structure after obtaining each second data block.

[0190] In this step, after constructing the hierarchical graph index structure corresponding to a batch of data blocks, blocks are selected sequentially for that batch of data blocks, and a similar block search is performed from the hierarchical graph index structure. The principle of this step is similar to the description in S603 or S706 to S711 above, and will not be repeated here.

[0191] S1004, after clearing all data block records of the first batch in the hierarchical graph index structure, add the data block pairs of the second batch to the hierarchical graph index structure according to the gradient fingerprint matching degree.

[0192] In this step, since the record of each candidate data block is cleared after finding all its similar blocks in the hierarchical graph index structure, the hierarchical graph index structure is cleared after all similar blocks of a batch of data blocks have been traversed through the index structure. Then, based on the traversal results of that batch, the subsequent reduction processing flow begins, and the data block pairs of the next batch (i.e., the second batch) can be added to the hierarchical graph index structure according to the gradient fingerprint matching degree, thus reusing the index structure.

[0193] In this way, since the hierarchy and the number of blocks that the hierarchical graph index structure can accommodate are fixed, only one memory resource needs to be allocated for the index structure during the search for similar blocks in all batches of data blocks. This avoids unnecessary duplicate allocation of memory resources, thereby reducing the impact on front-end business and improving data reduction speed. Therefore, this paper provides the following experimental data. Table 1 shows the results of experiments using front-end I / O of an all-flash storage system (Dorado) to generate multiple datasets of different sizes for verification. As shown in Table 1, for datasets of 50GB, 100GB, and even 1TB, if the all-flash storage system Demo (Demonstration) is used to directly process these datasets, the reduction time of the storage system Demo increases from 53s to 1057s as the dataset size increases, which is quite time-consuming. However, by using the hierarchical graph traversal scheme described in the embodiments of this application, the dataset reduction time can be significantly reduced compared to the storage system Demo. The data reduction ratio can be improved by 7.54% on datasets of different sizes, and the data reduction speed is improved by an average of 50.9%. Under the condition that the generated dataset has a single and fixed pattern, it shows a relatively stable reduction ratio and reduction ratio improvement rate.

[0194] Table 1

[0195] This paper also provides the experimental results shown in Table 2 below. The experiment used five open-source datasets, each representing a different application scenario for scaling down. Taxi represents user scenarios, Vsi represents virtual machine office scenarios, SOF represents database backup scenarios, Linux represents data source code storage scenarios, and VM represents user virtual machine storage scenarios. For these five datasets, the aforementioned storage system demo exhibited significant fluctuations in scaling down time, and all scenarios were quite lengthy. This is because no single scaling down scheme can typically adapt to all different scenarios. However, using the hierarchical graph traversal-based scheme described in this application embodiment still shows a significant improvement in scaling down time for these five datasets compared to the storage system demo. Furthermore, the hierarchical graph traversal-based scheme described in this application embodiment only shows a relative decrease in scaling down ratio in a very few scenarios (such as Linux), but in most scenarios, the scaling down ratio is significantly improved compared to the storage system demo.

[0196] Table 2

[0197] As can be seen from Tables 1 and 2 above, the hierarchical graph traversal-based scheme described in this application embodiment has superior reduction speed and reduction ratio in processing different scenarios and datasets of different sizes, and its processing performance is stable.

[0198] This application also provides a computing device 10. As shown in FIG11, the computing device 10 includes a bus, a processor 110, a memory 120, and a communication interface 130a. The processor 110, the memory 120, and the communication interface 130a communicate with each other via a bus 102. The computing device 10 may be a server or a terminal device. It should be understood that this application does not limit the number of processors and memories in the computing device 10.

[0199] Bus 102 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, only one line is used in Figure 11, but this does not imply that there is only one bus or one type of bus. A bus can include pathways for transmitting information between various components of the computing device 10 (e.g., memory 120, processor 110, communication interface 130a).

[0200] The processor 110 may include any one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).

[0201] The memory 120 may include volatile memory, such as random access memory (RAM). The processor 110 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state drive (SSD).

[0202] The memory 120 stores executable program code, which the processor 110 executes to implement the functions of the aforementioned feature data extraction module 101, hierarchical graph index construction module 102, hierarchical graph index retrieval module 103, and data reduction module 104, thereby realizing the aforementioned data processing method. In other words, the memory 120 stores instructions for executing this data processing method.

[0203] The communication interface 103 uses transceiver modules such as, but not limited to, network interface cards and transceivers to enable communication between the computing device 10 and other devices or communication networks.

[0204] This application also provides a computing device cluster. The computing device cluster includes at least one computing device. The computing device can be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.

[0205] As shown in Figure 12, the computing device cluster includes at least one computing device 10. The memory 120 of one or more computing devices 10 in the computing device cluster may store the same instructions for executing data processing methods.

[0206] In some possible implementations, the memory 120 of one or more computing devices 10 in the computing device cluster may also store partial instructions for executing data processing methods. In other words, a combination of one or more computing devices 10 can jointly execute instructions for executing data processing methods.

[0207] In some possible implementations, one or more computing devices in a computing device cluster can be connected via a network. This network can be a wide area network (WAN) or a local area network (LAN), etc. Figure 13 illustrates one possible implementation. As shown in Figure 13, two computing devices 10A and 10B are connected via a network. Specifically, they are connected to the network through communication interfaces in each computing device. In this type of possible implementation, the memory 120 in computing device 10A stores instructions for executing the functions of the feature data extraction module 101, the hierarchical graph index construction module 102, and the hierarchical graph index retrieval module 103. Simultaneously, the memory 120 in computing device 10B stores instructions for executing the functions of the data reduction module 104.

[0208] It should be understood that the functions of computing device 10A shown in Figure 13 can also be performed by multiple computing devices 10. Similarly, the functions of computing device 10B can also be performed by multiple computing devices 10.

[0209] This application also provides a computer program product containing instructions. The computer program product may be a software or program product containing instructions, capable of running on a computing device or stored on any usable medium. When the computer program product is run on at least one computing device, it causes the at least one computing device to perform a data processing method.

[0210] This application also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium that a computing device can store, or a data storage device such as a data center containing one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state drive). The computer-readable storage medium includes instructions that instruct the computing device to perform a data processing method.

[0211] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the protection scope of the technical solutions of the embodiments of the present invention.

Claims

1. A data processing method, characterized in that, The method includes: Determine the feature data of several data blocks, the feature data including gradient fingerprints, the gradient fingerprints being used to characterize the content features of the data blocks in multiple dimensions; Based on the gradient fingerprint matching degree of the data block pairs formed by the data blocks, a hierarchical graph index structure of the data blocks is generated. The hierarchical graph index structure includes multiple layers with high and low levels. The multiple layers are used to distribute the data block pairs. The gradient fingerprint matching degree of the data block pairs corresponds to the level of the distributed layers. According to the hierarchical order of the multiple layers, each second data block that is a data block pair with the first data block is traversed from the layered graph index structure, where the first data block is one of the multiple data blocks; Based on the traversal results, the first data block and the second data block are subjected to data reduction processing.

2. The method according to claim 1, characterized in that: The feature data also includes similar fingerprints. After determining the characteristic data of several data blocks, the method includes: The data blocks are sorted lexicographically according to the similar fingerprints to obtain a block sequence; The step of generating a hierarchical graph index structure for the data blocks based on the gradient fingerprint matching degree of the data block pairs formed by the data blocks includes: The block sequence is calculated using a preset sliding window to obtain the gradient fingerprint matching degree of each data block pair located within the sliding window. The size of the sliding window is W data blocks, where W≥2. Based on the gradient fingerprint matching degree of each data block pair, a corresponding hierarchical graph index structure is generated.

3. The method according to claim 1 or 2, characterized in that: The step of generating a hierarchical graph index structure for the data blocks based on the gradient fingerprint matching degree of the data block pairs formed by the data blocks includes: From the plurality of data blocks, a subset of blocks is determined, wherein any one of the data blocks in the subset of blocks is a data block pair with at least one other data block, and the data blocks in the data block pairs have the same similar fingerprint; Based on the gradient fingerprint matching degree of each data block pair in the block subset, a corresponding hierarchical graph index structure is generated.

4. The method according to any one of claims 1-3, characterized in that: The data blocks are divided into multiple batches, including a first batch and a second batch. The method includes: Construct the initialized hierarchical graph index structure; The data block pairs in the first batch of data blocks are added to the hierarchical graph index structure according to the gradient fingerprint matching degree; According to the hierarchical order of the multiple layers, each second data block that is a data block pair with the first data block is traversed from the hierarchical graph index structure, and after obtaining each second data block, the record of the first data block in the hierarchical graph index structure is cleared. The first data block is one of the data blocks in the first batch. If all data block records of the first batch are cleared in the hierarchical graph index structure, the data block pairs of the second batch are added to the hierarchical graph index structure according to the gradient fingerprint matching degree.

5. The method according to claim 4, characterized in that: In each batch, the first element of the similar fingerprints of the data blocks is the same, and / or the total size of the data blocks in each batch does not exceed the capacity of the buffer used to process that batch.

6. The method according to any one of claims 1-5, characterized in that, Each data block has a tag array that records the various layer levels to which the data block belongs in the hierarchical graph index structure; The step of traversing each second data block that is a data block pair with the first data block from the layered graph index structure according to the hierarchical order of the multiple layers includes: Select the first data block from the plurality of data blocks; According to the hierarchical order of each layer described by the tag array of the first data block, the first data block and each corresponding second data block are traversed layer by layer, wherein the traversal order of each data block first traversed from the hierarchical graph index structure is recorded by the first queue. After traversing all the second data blocks on any of the layers, clear the records on that layer related to the first data block.

7. The method according to claim 6, characterized in that, The step of traversing each second data block that is a data block pair with the first data block from the layered graph index structure according to the hierarchical order of the multiple layers includes: According to the tag array of the first data block, the first tuple information is added to the second queue. The second queue is used to store the tuple information of the data block to be traversed. The tuple information includes the number information of the highest layer level of the data block in the hierarchical graph index structure. The first tuple information is the tuple information of the first data block. If the first tuple information is located at the head of the second queue, then in the highest layer described by the first tuple information, the second data block corresponding to the first data block is traversed, and the second data block traversed for the first time is added to the first queue, and the tuple information of each second data block traversed is added to the second queue. Clear the records of the first data block in the highest layer and update the tag array of the first data block; Based on the updated tag array, generate a second tuple and add it to the second queue; If the second tuple information is located at the head of the second queue, then the highest layer described by the second tuple information is traversed to traverse the second data blocks corresponding to the first data block until all the second data blocks corresponding to the first data block are obtained.

8. The method according to claim 6 or 7, characterized in that, The step of performing data reduction processing on the first data block and the second data block based on the traversal results includes: All data blocks in the first queue are divided into groups according to a preset granularity, and at least one group includes the first data block and at least one second data block; The data blocks in each group are reduced in size.

9. A data processing apparatus, characterized in that, The system includes: The feature data extraction module is used to determine the feature data of several data blocks. The feature data includes gradient fingerprints, which are used to characterize the content features of the data blocks in multiple dimensions. A hierarchical graph index construction module is used to generate a hierarchical graph index structure for the data blocks based on the gradient fingerprint matching degree of the data block pairs formed by the data blocks. The hierarchical graph index structure includes multiple layers with high and low levels. The multiple layers are used to distribute the data block pairs. The gradient fingerprint matching degree of the data block pairs corresponds to the level of the distributed layers. The layered graph index retrieval module is used to traverse each second data block that is a data block pair with the first data block from the layered graph index structure according to the hierarchical order of the multiple layers, wherein the first data block is one of the multiple data blocks; The data reduction module is used to perform data reduction processing on the first data block and the second data block based on the traversal results.

10. The system according to claim 9, characterized in that: The feature data also includes similar fingerprints. The feature data extraction module is further configured to sort each of the data blocks according to the lexicographical order of the similar fingerprints to obtain a block sequence; The hierarchical graph index construction module is used for: The block sequence is calculated using a preset sliding window to obtain the gradient fingerprint matching degree of each data block pair located within the sliding window. The size of the sliding window is W data blocks, where W≥2. Based on the gradient fingerprint matching degree of each data block pair, a corresponding hierarchical graph index structure is generated.

11. The system according to claim 9 or 10, characterized in that: The hierarchical graph index construction module is used for: From the plurality of data blocks, a subset of blocks is determined, wherein any one of the data blocks in the subset of blocks is a data block pair with at least one other data block, and the data blocks in the data block pairs have the same similar fingerprint; Based on the gradient fingerprint matching degree of each data block pair in the block subset, a corresponding hierarchical graph index structure is generated.

12. The system according to any one of claims 9-11, characterized in that: The data blocks are divided into multiple batches, including a first batch and a second batch. The hierarchical graph index construction module is used for: Construct the initialized hierarchical graph index structure; The data block pairs in the first batch of data blocks are added to the hierarchical graph index structure according to the gradient fingerprint matching degree; The hierarchical graph index retrieval module is used for: According to the hierarchical order of the multiple layers, each second data block that is a data block pair with the first data block is traversed from the hierarchical graph index structure, and after obtaining each second data block, the record of the first data block in the hierarchical graph index structure is cleared. The first data block is one of the data blocks in the first batch. The system also includes a hierarchical graph index structure reuse module, used to add data block pairs from the second batch to the hierarchical graph index structure according to the gradient fingerprint matching degree when all data block records of the first batch are cleared in the hierarchical graph index structure.

13. The system according to claim 12, characterized in that: In each batch, the first element of the similar fingerprints of the data blocks is the same, and / or the total size of the data blocks in each batch does not exceed the capacity of the buffer used to process that batch.

14. The system according to any one of claims 9-13, characterized in that, Each data block has a tag array that records the various layer levels to which the data block belongs in the hierarchical graph index structure; The hierarchical graph index retrieval module is used for: Select the first data block from the plurality of data blocks; According to the hierarchical order of each layer described by the tag array of the first data block, the first data block and each corresponding second data block are traversed layer by layer, wherein the traversal order of each data block first traversed from the hierarchical graph index structure is recorded by the first queue. After traversing all the second data blocks on any of the layers, clear the records on that layer related to the first data block.

15. The system according to claim 14, characterized in that, The hierarchical graph index retrieval module is used for: According to the tag array of the first data block, the first tuple information is added to the second queue. The second queue is used to store the tuple information of the data block to be traversed. The tuple information includes the number information of the highest layer level of the data block in the hierarchical graph index structure. The first tuple information is the tuple information of the first data block. If the first tuple information is located at the head of the second queue, then in the highest layer described by the first tuple information, the second data block corresponding to the first data block is traversed, and the second data block traversed for the first time is added to the first queue, and the tuple information of each second data block traversed is added to the second queue. Clear the records of the first data block in the highest layer and update the tag array of the first data block; Based on the updated tag array, generate a second tuple and add it to the second queue; If the second tuple information is located at the head of the second queue, then the highest layer described by the second tuple information is traversed to traverse the second data blocks corresponding to the first data block until all the second data blocks corresponding to the first data block are obtained.

16. The system according to claim 14 or 15, characterized in that, The data reduction module is used for: All data blocks in the first queue are divided into groups according to a preset granularity, and at least one group includes the first data block and at least one second data block; The data blocks in each group are reduced in size.

17. A computing device, characterized in that, Including processor and memory; The processor is configured to execute instructions stored in the memory to cause the processor to perform the method as described in any one of claims 1-8.

18. A computing device cluster, characterized in that, It includes at least one computing device, each computing device including a processor and memory; The processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the method as described in claims 1-8.

19. A computer program product containing instructions, characterized in that, When the instruction is executed by the computing device cluster, the computing device cluster causes the computing device cluster to perform the method as described in claims 1-8.