Fingerprint generation method, compression method and electronic device for file data
By dividing file data into data strings and using bitwise operations with a mask set to generate fingerprints, the fingerprint generation process is simplified, achieving low computational overhead and fast delta compression. This solves the problem of high computational overhead in existing technologies and improves file data compression efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUAZHONG UNIV OF SCI & TECH
- Filing Date
- 2025-09-11
- Publication Date
- 2026-06-26
Smart Images

Figure CN121301296B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of data compression technology, and more specifically, relates to a method for generating fingerprints of file data, a compression method, and an electronic device. Background Technology
[0002] Data compression techniques reduce storage costs and overhead by managing the same logical data with less physical storage resources. Delta compression is a data compression technique that, when compressing file data, finds similar files in the existing file data and then performs delta compression by referencing these similar files.
[0003] Existing delta compression methods for file data typically involve two main steps: fingerprint generation and delta encoding. Fingerprint generation, using a rolling hash algorithm to generate fingerprints for identifying similar data blocks, forms the basis for subsequent delta encoding, and its performance directly impacts the performance of the latter. Therefore, researching a fingerprint generation method for identifying similar file data is of great significance.
[0004] Existing methods for generating fingerprints of file data typically generate a hash code for each data string in the file data, then perform a linear transformation on the hash code, and generate the file data fingerprint based on the linearly transformed hash code; however, the linear transformation process in this method involves complex multiplication operations, which leads to high computational overhead. Summary of the Invention
[0005] In view of the above-mentioned defects or improvement needs of the prior art, the present invention provides a method for generating fingerprints of file data, a compression method, and an electronic device to solve the technical problem of high computational overhead in the prior art.
[0006] To achieve the above objectives, in a first aspect, the present invention provides a method for generating fingerprints of document data, comprising:
[0007] A1. Divide the file data into n data strings and calculate the hash code of each data string; where the length of the hash code of each data string is L; n and L are both positive integers;
[0008] A2. For the hash code of each data string in the file data, the hash code features are matched with the hash code features in the preset seed feature set to classify the hash code of each data string into the corresponding hash code category, resulting in a total of K hash code sets; wherein, the preset seed feature set includes: K different hash code features that correspond one-to-one with the K hash code categories; K is the number of bits in the preset file data fingerprint;
[0009] A3. Perform bitwise operations on each hash code in the k-th hash code set and the k-th mask in the preset mask set. If all bitwise operation results are zero, then set the k-th bit of the file data fingerprint to the first value; otherwise, set the k-th bit of the file data fingerprint to the second value. The preset mask set includes: K randomly generated masks of different lengths L; the first value is 0 or 1, the second value is 0 or 1, and the second value is 1 when the first value is 0, and the second value is 0 when the first value is 1.
[0010] More preferably, the first in the file data i The hash code of the data string is classified into the first... Under each hash code category; among them... For the first i The hash code of a data string converted to decimal; .
[0011] More preferably, the bitwise operation is any one of AND operation, OR operation, XOR operation, bitwise AND operation, bitwise OR operation, bitwise XOR operation, bitwise AND operation, bitwise OR operation, bitwise XOR operation;
[0012] Specifically, the bitwise AND operation represents shifting the hash code by a preset number of bits and then performing an AND operation with the mask; the bitwise OR operation represents shifting the hash code by a preset number of bits and then performing an OR operation with the mask; the bitwise XOR operation represents shifting the hash code by a preset number of bits and then performing an XOR operation with the mask; the bitwise NOT AND operation represents inverting the hash code by bits and then performing an AND operation with the mask; the bitwise NOT OR operation represents inverting the hash code by bits and then performing an OR operation with the mask; and the bitwise NOT XOR operation represents inverting the hash code by bits and then performing an XOR operation with the mask. More preferably, the hash code of the data string is calculated using a rolling hash algorithm.
[0013] Secondly, the present invention provides a method for compressing file data, comprising:
[0014] B1. Using the fingerprint generation method provided in the first aspect of this invention, generate the fingerprint of the file data A to be compressed. ;
[0015] B2. Obtain the fingerprint hash table; the fingerprint hash table is used to store historically processed file data, its fingerprints, and fingerprint characteristics; determine if the fingerprint hash table is empty. If it is, proceed to B9; otherwise, proceed to B3.
[0016] B3. Query the fingerprint feature F of file data A in the fingerprint hash table. A Does it exist? If it does, retrieve F from the fingerprint hash table.A The corresponding file data B and its fingerprint If yes, proceed to B4; otherwise, proceed to B9.
[0017] B4. Divide the file data B into n data strings, calculate the hash code of each data string in the file data B, and insert the hash code of each data string and its position in the file data B into a temporary hash table; n is a positive integer;
[0018] B5. Divide file data A into n data strings, and calculate the hash code of each data string in file data A; divide file data A into n data strings ... i hash code of a data string The hash code features are matched with the hash code features in the preset seed feature set to determine... The corresponding hash code category ;in, The preset seed feature set is the preset seed feature set used in the fingerprint generation method provided in the first aspect of the present invention;
[0019] B6. Fingerprint identification With fingerprints The first in If the bits are the same, proceed to B7; otherwise, proceed to B8.
[0020] B7. Search in the temporary hash table If it exists, retrieve it from the temporary hash table. The corresponding data string is located in file data B, and is used as the position of the first data string in file data A. i If the encoding result of the data string is found, the operation ends; otherwise, proceed to B8.
[0021] B8. Transfer the first data in file A. i The data string is directly used as the first data in file A. i The encoding result of each data string is obtained, and the operation ends.
[0022] B9. Transfer file data A and its fingerprint. fingerprint features F A They are then inserted into the fingerprint hash table, and the operation is complete.
[0023] More preferably, the fingerprint feature of the file data is a preset number of bits at a preset position in the fingerprint of the file data.
[0024] More preferably, the fingerprint feature of the file data is the bits in the latter half of the fingerprint of the file data.
[0025] More preferably, The corresponding hash code category for ; for The result after conversion to decimal; K is the number of hash code features in the preset seed feature set.
[0026] Thirdly, the present invention provides an electronic device, comprising: a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the method provided in the first or second aspect of the present invention.
[0027] Fourthly, the present invention also provides a computer-readable storage medium comprising a stored computer program, wherein the computer program, when executed by a processor, controls the device in which the storage medium is located to perform the method provided in the first or second aspect of the present invention.
[0028] Fifthly, the invention also provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the method provided in the first or second aspect of the invention.
[0029] In summary, the above-described technical solutions conceived in this invention can achieve the following beneficial effects:
[0030] 1. This invention provides a method for generating fingerprints of file data. The hash code of each data string is used as the content information of the file data. The dimensionality is reduced by masking without changing the content information. The fingerprint obtained can accurately represent the file data and does not involve complex multiplication operations, resulting in low computational overhead.
[0031] 2. Furthermore, in the method for generating fingerprints of file data provided by this invention, the first fingerprint in the file data... i The hash code of the data string is classified into the first... Compared to other classification methods, this method is simpler and requires less computation under each hash code category.
[0032] 3. Furthermore, the fingerprint generation method for file data provided by the present invention calculates the hash code of the data string using a rolling hash algorithm, which further improves the calculation speed.
[0033] 4. This invention provides a method for compressing file data. First, the fingerprint generation method provided in the first aspect of this invention is used to generate a fingerprint of the file data to be compressed with low computational overhead. Then, similar file data is searched based on the fingerprint. The fingerprint contains the content information of the file data. Two files with the same fingerprint are similar files. Subsequently, delta encoding is performed on the found similar file data to achieve file data compression. During encoding, the fingerprint difference between the two files is used to skip the check of different data strings, which speeds up the compression encoding speed. This invention solves the problem of high computational overhead and slow speed of fingerprint recognition. At the same time, fingerprints can be used to assist encoding to achieve fast compression encoding.
[0034] 5. The file data compression method provided by this invention establishes a connection between fingerprint generation in similarity detection and identical string recognition in encoding during compression, thereby achieving fast fingerprint generation and encoding. This technology can solve the problems of high overhead, slow speed, and ineffective deployment in storage systems associated with existing delta compression. Attached Figure Description
[0035] Figure 1 A flowchart of a method for generating fingerprints of file data provided in an embodiment of the present invention;
[0036] Figure 2 This is a schematic diagram of the file data compression method framework provided in an embodiment of the present invention;
[0037] Figure 3 A schematic diagram comparing the throughput of fingerprint generation using different methods provided in embodiments of the present invention;
[0038] Figure 4 A schematic diagram comparing the throughput of Delta compression encoding using different methods provided in embodiments of the present invention. Detailed Implementation
[0039] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.
[0040] To achieve the above objectives, in a first aspect, the present invention provides a method for generating fingerprints of file data, such as... Figure 1 As shown, it includes:
[0041] A1. Divide the file data into n data strings and calculate the hash code of each data string; where the length of the hash code of each data string is L; n and L are both positive integers;
[0042] A2. For the hash code of each data string in the file data, the hash code features are matched with the hash code features in the preset seed feature set to classify the hash code of each data string into the corresponding hash code category, resulting in a total of K hash code sets; wherein, the preset seed feature set includes: K different hash code features that correspond one-to-one with the K hash code categories; K is the number of bits in the preset file data fingerprint;
[0043] A3. Perform bitwise operations on each hash code in the k-th hash code set and the k-th mask in the preset mask set. If both results of the bitwise operations are zero, then set the k-th bit of the file data fingerprint to the first value; otherwise, set the k-th bit of the file data fingerprint to the second value. The preset mask set includes: K randomly generated masks of different lengths L; the first value is 0 or 1, the second value is 0 or 1, and the second value is 1 when the first value is 0, and the second value is 0 when the first value is 1.
[0044] It should be noted that the hash code of the data string can be calculated using any existing hash algorithm, such as rolling hash, MD5, Rabin-Karp, MD5 code, SHA-1, CRC32, Jenkins Hash, CityHash, SipHash, Alder-32, etc., without limitation. Preferably, in one optional implementation, the hash code of the data string is calculated using the rolling hash algorithm, which can further improve the calculation speed and is more suitable for the scenario of calculating the hash code for each string of file data.
[0045] It should be noted that there are various ways to obtain hash code features, such as modulo, bitwise, truncation, taking the logarithmic range, taking the square root range, taking the numerical range, or using hash function mapping, etc. No limitation is made here; only that the number of hash code feature categories must be K. Preferably, in one optional implementation, the first... i The hash code of the data string is classified into the first... Under each hash code category; among them... For the first i The hash code of a data string converted to decimal; In this embodiment, the hash code feature is obtained by performing a modulo operation between the hash code of the data string and K, and includes a total of These K possible values are respectively related to These K hash codes correspond one-to-one. This design is obtained by statistically analyzing the hash codes of historical data strings, and it is simpler than other classification methods.
[0046] In one optional implementation, the bitwise operation described above is any one of AND operation, OR operation, XOR operation, bitwise AND operation, bitwise OR operation, bitwise XOR operation, bitwise AND operation, bitwise OR operation, bitwise XOR operation;
[0047] The bitwise AND operation represents shifting the hash code by a preset number of bits and then performing an AND operation with the mask; the bitwise OR operation represents shifting the hash code by a preset number of bits and then performing an OR operation with the mask; the bitwise XOR operation represents shifting the hash code by a preset number of bits and then performing an XOR operation with the mask; the bitwise NOT AND operation represents bitwise NOT AND operation represents bitwise NOT AND operation represents bitwise NOT OR operation represents bitwise NOT OR AND operation represents bitwise NOT AND OR operation represents bitwise NOT XOR operation represents bitwise NOT XOR AND AND operation represents bitwise NOT XOR ...
[0048] Secondly, the present invention provides a method for compressing file data, comprising:
[0049] B1. Using the fingerprint generation method provided in the first aspect of this invention, generate the fingerprint of the file data A to be compressed. The related technical solutions are the same as the fingerprint generation method provided in the first aspect of this invention, and will not be described in detail here.
[0050] B2. Obtain the fingerprint hash table; the fingerprint hash table is used to store historically processed file data, its fingerprints, and fingerprint characteristics; determine if the fingerprint hash table is empty. If it is, proceed to B9; otherwise, proceed to B3.
[0051] B3. Query the fingerprint feature F of file data A in the fingerprint hash table. A Does it exist? If it does, retrieve F from the fingerprint hash table. A The corresponding file data B and its fingerprint Proceed to B4; otherwise, proceed to B9. It should be noted that when multiple file data A's fingerprint features F are found in the fingerprint hash table... A When, select F A The corresponding data can be B from one of the files, or it can be F. AAny corresponding file data B can also be a file data B most similar to file data A (e.g., the fingerprints of file data A and file data B can be used). and The Hamming distance between two files measures the similarity between them. Alternatively, it can be the closest file B to file A in the file dataset; there is no limitation here.
[0052] B4. Divide the file data B into n data strings, calculate the hash code of each data string in the file data B, and insert the hash code of each data string and its position in the file data B into a temporary hash table; n is a positive integer;
[0053] B5. Divide file data A into n data strings, and calculate the hash code of each data string in file data A; divide file data A into n data strings ... i hash code of a data string The hash code features are matched with the hash code features in the preset seed feature set to determine... The corresponding hash code category ;in, The preset seed feature set is the preset seed feature set used in the fingerprint generation method provided in the first aspect of the present invention;
[0054] B6. Fingerprint identification With fingerprints The first in If the bits are the same, proceed to B7; otherwise, proceed to B8.
[0055] B7. Search in the temporary hash table If it exists, retrieve it from the temporary hash table. The corresponding data string is located in file data B, and is used as the position of the first data string in file data A. i The encoding result of each data string is obtained, and the operation ends. At this point, the compression of the file to be compressed, data A, is complete, and the compression of the next file data can continue; otherwise, proceed to B8.
[0056] B8. Transfer the first data in file A. i The data string is directly used as the first data in file A. i The encoding result of each data string is obtained, and the operation ends. At this point, the compression of the file A to be compressed is complete, and the compression of the next file data can continue.
[0057] B9. Transfer file data A and its fingerprint. fingerprint features F AThey are inserted together into the fingerprint hash table, and the operation ends. At this point, the compression of file A fails, and the compression of the next file can continue.
[0058] It should be noted that there are various ways to obtain the fingerprint features of file data, such as extracting a preset number of bits at preset positions in the fingerprint, extracting multiple individual bits at specific positions in the fingerprint, extracting multiple bits within a preset range in the fingerprint, or sorting the extracted multiple bits according to a preset algorithm or performing hash calculations, etc., which are not limited here. In one optional implementation, the fingerprint feature of the file data is a preset number of bits at preset positions in the fingerprint of the file data. In this invention, the fingerprint of the file data contains K bits, and the fingerprint feature of the file data can be the first K / 2 bits of the fingerprint of the file data, or the last K / 2 bits of the fingerprint of the file data, or a preset number of bits at other preset positions in the fingerprint of the file data, which are not limited here. Preferably, the fingerprint feature of the file data is the bits in the latter half of the fingerprint of the file data (i.e., the last K / 2 bits).
[0059] In one alternative implementation, The method for confirming the corresponding hash code category is the same as the method for confirming the hash code category in the fingerprint generation method provided in the first aspect of the present invention, and will not be described in detail here; preferably, The corresponding hash code category for ; for The result after conversion to decimal; K is the number of hash code features in the preset seed feature set.
[0060] In summary, addressing the shortcomings and improvement needs of current delta compression technology, this invention provides a method and technology for rapid fingerprint generation and delta encoding in delta compression. This addresses the problems of slow fingerprint generation and delta encoding calculation speeds and the lack of connection between the two processes in existing solutions. It effectively improves the speed of fingerprint generation and delta encoding calculations by mapping the fingerprint calculated in fingerprint generation to the fingerprint of the file data, establishing a connection between fingerprint generation and delta encoding, and accelerating delta encoding processing.
[0061] To further illustrate the fingerprint generation and compression methods for file data provided by the present invention, a specific embodiment is described in detail below:
[0062] This embodiment establishes a connection between the fingerprint generation and delta encoding stages of delta compression by using the same rolling hash algorithm to calculate fingerprints in both stages. By storing the fingerprint information obtained during fingerprint generation (i.e., classifying the hash codes of the data string) in a fingerprint format, the fingerprint possesses the ability to find similar data blocks for the target data block, and the obtained similar data blocks can be processed using the fingerprint for fast delta encoding. Content-based rolling hashing is characterized by its fast computation speed, thus enabling rapid fingerprint generation. Similarly, delta encoding implemented using fingerprints also boasts fast computation capabilities. Combining these optimizations, fast delta compression is essentially achieved.
[0063] Delta compression is a computational process between two similar data blocks. Therefore, before performing delta compression, it is necessary to find a similar file for the file data to be compressed (denoted as the target file data) to achieve delta compression. Thus, delta compression first requires calculating a fingerprint for the object to be compressed to retrieve similar data blocks; this process is called fingerprint generation. The process of finding similar data blocks using fingerprints is called similarity lookup. Similar data blocks are found through similarity lookup and the target data block is stored in a fingerprint hash table. Having determined the file data to be compressed and its similar files, delta compression encodes the file data to be compressed using delta encoding. In this invention, fingerprints can be used to assist delta encoding and reduce overhead. The above describes the process of performing delta compression on a single file. Performing delta compression on multiple files simply requires repeating the above process.
[0064] Below, this invention provides a practical example to illustrate how to achieve fast fingerprint generation and delta encoding delta compression processing. The process can be found in the following example. Figure 2For ease of understanding, this example makes the following assumptions: the dataset to be compressed is 8 KiB (8192B), each file is 64B in size, and there are a total of 128 files. In this embodiment, the 64th file is to be delta compressed; the first to 63rd files have already been compressed. The content-based rolling hash algorithm has a rolling window of 8B, scrolling 1B at a time, resulting in an 8-bit fingerprint each time. In this embodiment, the fingerprint size is 8 bits, i.e., K=8. Eight binary masks are defined to focus on eight different types of fingerprints. Specifically, eight masks are pre-generated: 1000 0000,0100 0000,0010 0000,0001 0000,0000 1000,0000 0100,0000 0010,0000 0001, which constitute the preset mask set.
[0065] A fingerprint is calculated for the file data to be compressed. Specifically, a content-based rolling hash is used to calculate the fingerprint for the 64th file data. Since the rolling window is 8 bytes, and it rolls 1 byte at a time, there are a total of 64 - 8 + 1 = 57 data strings, which can generate 57 hash codes. In this embodiment, the hash code of the first data string of the file data to be compressed is 0110 1110. After converting 01101110 to decimal, a modulo operation is performed with K to reduce the dimension and map it to the 6th bit of the similar hash (i.e., classifying it into the 6th category of hash codes), that is, the 6th bit from the lowest bit. An AND operation is then performed on this fingerprint and the 6th mask, i.e., ... The value is not 0, so the 6th bit of the fingerprint is set to 1. The "AND" operation in this example is just for illustration; other suitable operations, such as OR, XOR, bitwise AND, bitwise OR, bitwise XOR, bitwise NOT AND, bitwise NOT OR, bitwise NOT XOR, etc., can also be used. In this embodiment, the hash code of the 16th data string of the file data to be compressed is 0110 1110. Similarly, after the above operations, it can be mapped to the 6th bit of a similar hash code. Therefore, the calculation... The result is 0, which does not change the result that the 6th bit of the fingerprint is 1. That is, the value at the corresponding position of the similar hash code will not change after being changed from 0 to 1. Performing the above operation on the hash codes of these 57 data strings respectively completes the fingerprint generation calculation for the 64th file data. In this embodiment, the fingerprint generated for the file data is 1011 0110. In this embodiment, the hash code of the data string is mapped to 1 bit in the similar hash code, but it can be mapped to more than just 1 bit.
[0066] The specific process described above is as follows:
[0067] First, a sliding window is used to divide the data in the data block into data strings. The sliding window moves one byte towards the end of the file data each time. The 64th data string is marked as file data A. After data stringing, the file data can be represented as follows: ,in This indicates the first [item] in file data A. i A data string.
[0068] Then, a hash code is calculated for each data string of file data A using a content-based rolling hash function g, where the hash code of the i-th data string is represented as: This hash code represents the data string. The content information, using content-based rolling hashing, accelerates fingerprint generation;
[0069] Next, hash code The hash code cannot be directly used as a fingerprint of file data A. Classification after dimensionality reduction. If... After classification, it belongs to the k-th hash code category. Let the mask of the k-th hash code category be denoted as . ,Will and Perform bitwise operations, and the result of the bitwise operations is directly mapped to the fingerprint of file data A. The k-th position, the value of which is denoted as .
[0070] In this embodiment, the first value is 0 and the second value is 1; first, the fingerprint of file data A is initialized. A string of numbers consisting of K zero bits; when and ( If all bitwise operations result in 0, then It remains unchanged, or is 0; when and ( If any of the bitwise operations result in a non-zero value, then... Set to 0.
[0071] Finally, perform the above calculations on all data strings of file data A to obtain the fingerprint of file data A. , Each element in the fingerprint represents the content information of the data string in file data A, so similar file data can be searched using fingerprints.
[0072] Each bit of a fingerprint contains information about the file data, and each bit corresponds to a data string within the file data. The position of the fingerprint can be used to accelerate the encoding of the corresponding data string within the file data. Fingerprints also have the ability to find similar file data; file data with the same fingerprint are considered similar and can be used to perform delta encoding for data compression.
[0073] This step differs from existing solutions in that, after obtaining the hash code using a content-based rolling hash algorithm, it does not further generate fingerprints through sampling. Instead, it reduces the dimensionality of these hash codes and saves them into the fingerprint, where most hash codes are represented. Existing solutions typically select only one hash code, transforms it, and then generates the fingerprint. The advantage of this step is that, through simple dimensionality reduction and mask calculation, fingerprint generation is very fast, effectively improving throughput. Furthermore, the fingerprint retains some data string information from the file data, making it more suitable for similarity matching. This data string information can also be used to assist in faster delta encoding. Another advantage of the fingerprints generated by this invention is that the number of identical bits at corresponding positions in the fingerprints of two file data can also be used to determine the similarity between the two files.
[0074] In this embodiment, the delta compression process is as follows:
[0075] Look up the fingerprint feature F of file data A in the fingerprint hash table. A Does it exist? Then find its similar file data:
[0076] Specifically, the process of finding similar file data using the fingerprint hash table is as follows:
[0077] If no hash table entry corresponding to the fingerprint feature of the file data to be delta compressed is found, it means that similar file data to the file data to be delta compressed cannot be found. The fingerprint feature to be delta compressed is used as the "key", and the ID and fingerprint of the file data to be delta compressed are used as the "value" and stored in the fingerprint hash table. Then the next file data is processed.
[0078] When a hash table entry corresponding to the fingerprint feature of the file data to be delta compressed is found, one file data is taken from one of the found hash table entries as similar file data to the file data to be delta compressed. The file data to be delta compressed will be delta encoded with reference to this similar file data, which is also called the base data block.
[0079] When a similar file B is found to file A, the fingerprint is used to accelerate delta encoding. A hash code is calculated for each data string in file B. The hash code is used as the "key" and the offset of the data string in file B is used as the "value" and stored in a temporary hash table.
[0080] Divide file data A into n data strings, and calculate the hash code of each data string in file data A; divide file data A into n data strings ... i hash code of a data string The hash code features are matched with the hash code features in the preset seed feature set to determine... The corresponding hash code category Compare the fingerprints of file data A and file data B. Are the values at the same position the same? If they are different, directly check the first position of file data A. i Encode the first data string to speed up the delta encoding process; otherwise, check if the file data A's first data string exists in the temporary hash table. i The hash code of the data string is used to find the entry in the table. Based on its value, the hash code of the first data string in file A is used to retrieve the data. i The first data string is encoded as a "copy" instruction. If it is not found, the first data string in file A will be copied. i Each data string is encoded as an "insert" instruction.
[0081] Preferably, hash collisions may occur when the hash codes of data strings in similar file data are stored in a temporary hash table. In this case, the temporary hash table only retains the offset of the latest data string.
[0082] In the above process, fingerprints are used to speed up the delta encoding of file data A. Specifically, the following steps are involved:
[0083] For each data string in the target file data A Fingerprint calculation ,fingerprint Can determine data string Is it included in similar file data B?
[0084] Determine the data string hash code The corresponding hash code category By comparing the fingerprints of target file data A Fingerprint of similar file data B The Middle The value of each bit and To decide whether to skip querying the temporary hash table;
[0085] and The comparison results can be in one of the following two ways:
[0086] when and When the values are different: data string It is not included in similar file data B. This is a completely new data string and should be re-encoded; therefore, it will be included in the delta file. The code is set to "insert" and includes... The data content, namely It presents entirely new data;
[0087] when and When values are the same: the data string cannot be determined. Whether it is included in its similar file data B is determined by querying a temporary hash table for further processing; specifically, based on the data string of the target data block A. hash value Looking up a temporary hash table involves the following two scenarios:
[0088] Scenario 1: The hash was found in the temporary hash table. For the corresponding item, read the corresponding "value," which indicates the offset of the similar file data B. In the delta file, this will be... The code is encoded as a "copy" instruction, which means... Encoded as hash codes in a temporary hash table The corresponding offset.
[0089] Scenario 2: Fingerprint not found in temporary hash table The corresponding item is the data string. It is a completely new data string, not included in similar data block B, and will be in the delta file. The code is set to "insert" and includes... The data content is about to be released. The data string directly used as target data block A The encoding result.
[0090] Specifically, the following is a detailed process of this embodiment:
[0091] The information of the 64th file is stored, and similar files are searched. Specifically, the similarity information is stored in a fingerprint hash table, and similar files can be found by searching the fingerprint hash table. The fingerprint of the 64th file is 1011 0110. The lower 4 bits of the fingerprint are taken from the fingerprint hash table as the fingerprint feature, which is also used as the "key" for the search. That is, the "key" of the 64th file in the fingerprint hash table is 0110. Searching the fingerprint hash table based on this "key" involves two cases:
[0092] If a corresponding hash table entry is found, and the corresponding "value" is 24 or 32, it indicates that the 24th and 32nd file data are similar to the 64th file data, respectively. The 64th file data block can be delta encoded by referring to either the 24th or 32nd file data. For consistency, this embodiment uses the file data closest to the 64th file data in the reference file data set, namely the 32nd file data. Then, the information of the 64th data block is stored in the hash table entry corresponding to the "key" 0110 in the lower 4 bits of the fingerprint.
[0093] If no corresponding hash table entry is found, meaning there is no entry for "key" 0110 in the fingerprint hash table, it indicates that there are no files similar to the 64th file among the first to 63rd files in the reference file dataset. In this case, 0110 is used as the "key" of the fingerprint hash table, and 64 (representing the 64th file) and its fingerprint are stored as the "value" of the hash table entry corresponding to "key" 0110 in the fingerprint hash table. Then, the 65th file is processed step by step, that is, the fingerprint is calculated for the 65th file.
[0094] The above steps find that the similar file data block for the 64th file data is the 32nd file data. When performing delta encoding on the 64th file data, the 32nd file data will be referenced. First, the 32nd file data needs to be read from the file dataset. Based on the content of the 32nd file data, 57 hash codes are calculated for all 57 data strings of the 32nd file data. These 57 hash codes will be used as "keys", and their corresponding offsets will be stored as "values" in a temporary hash table. In this embodiment, if the hash codes are the same, the value with the larger data string sequence number is stored. For example, if the hash codes of the 12th and 48th data strings are both 1100 0010, then in the temporary hash table, the "value" corresponding to the "key" 1100 0010 is 48, indicating that it is 8 consecutive bytes starting from the 48th data string of the 32nd file data, i.e., 8 consecutive bytes.
[0095] This step is basically consistent with the process of existing solutions and technologies. It calculates hash codes on data strings in similar file data to prepare for delta encoding of the file data to be compressed. The file data to be compressed uses the hash codes of data strings in similar file data to determine whether the similar file data contains data strings from the file data to be compressed.
[0096] The delta encoding of the 64th file data is accelerated using the fingerprints of the 64th and 32nd file data: Specifically, the hash codes are calculated for all 57 data strings of the 64th file data in sequence. The hash code of its first data string is 0110 1110. After dimensionality reduction mapping, this is mapped to the 6th bit of the fingerprint (i.e., classified into the 6th hash code category). The values of the 6th bit of the fingerprints of the 64th and 32nd file data are checked. From low to high, the value of the 6th bit of the 64th data block is "1". The value of the 6th bit of the 32nd data block from low to high has the following two possibilities:
[0097] The value of the 6th bit from low to high in the 32nd file data is "1". It cannot be determined whether the first data string of the 64th file data exists in the 32nd file data. It is necessary to query the temporary hash table for further operation. This process is the same as the current delta compression encoding algorithm and has been described in the implementation. Specific embodiments are also described below, so they will not be repeated here.
[0098] The value of the 6th bit from low to high in the 32nd file data is "0", indicating that the first data string of the 64th file data must not exist in the 32nd file data. Therefore, this data string is encoded as an "insert" instruction in the delta file data, that is, "insert the first data string of the 64th file data", which means that the first data string of the 64th file data is used as the encoding result of the first data string in the 64th file data.
[0099] If the hash code corresponding to the 8th data string of the 64th file data is mapped to the 7th bit of the fingerprint (i.e., classified into the 7th hash code category) after dimensionality reduction mapping, the 7th bit of the fingerprint of the 64th file data is "0" from low to high. Regardless of the value of the 7th bit of the fingerprint of the 32nd file data from low to high, it is impossible to determine whether the 8th data string exists in the 32nd file data.
[0100] Based on the above process, it can be seen that the generated fingerprint can assist in delta encoding, avoiding the need to look up temporary hash tables for data strings not included in similar file data, thus preventing high query overhead. At the same time, it reuses the computation during fingerprint generation, avoiding wasted computational resources.
[0101] For cases where it cannot be determined whether a certain data string in the 64th file exists in the 32nd file:
[0102] 57 hash codes are calculated for each of the 57 data strings in the 64th file. If the 20th hash code is 1100 0010, and the fingerprints of the 64th and 32nd files have been compared, it cannot be determined whether the 20th data string of the 64th file exists in the 32nd file. Therefore, a query is performed in the temporary hash table to check if the hash code 1100 0010 of the 20th data string in the 64th file exists. If it does, the offset of hash code 1100 0010 in the 32nd file is 48 (i.e., its position in the 32nd file). In the delta file, the 20th data string of the 64th file is encoded as a "copy" instruction, i.e., "copy the 48th data string of the 32nd file". The offset of hash code 1100 0010 in the 48th data string of the 32nd file is then used as the encoding result of the 20th data string of file data A. If it does not exist, it indicates that the 20th data string is a completely new data string and should be encoded as an "insert" instruction in the delta file, that is, "insert the 20th data string of the 64th file data", which means that the 20th data string of the 64th file data is used as the encoding result of the 20th data string in the 64th file data.
[0103] In this embodiment, the hash of the data string of the 64th file is calculated according to the order of the rolling window of the rolling hash. Therefore, the encoding of the data string in the 64th file is also sequential, that is, the order of the data string in the delta file is the same as the original order of the 64th data string.
[0104] This step is basically the same as the existing solution. The delta encoding reference base data block encodes the data with fewer bits according to the order of the data strings in the target data block. The difference is that in the solution of this invention, some data blocks can be encoded by fingerprint without referring to similar file data.
[0105] This embodiment first calculates a fingerprint for the file data to be compressed, and calculates a hash code for each data string of the file data to be compressed, mapping it to the corresponding position in the fingerprint. Then, it searches for similar file data according to the fingerprint; two files with the same fingerprint are considered similar files. Subsequently, delta compression encoding is performed on the found similar file data. During delta compression encoding, the fingerprint differences between the two files are used to skip the checks of different data strings, thus speeding up the delta compression encoding process. Finally, the encoded file data is stored to achieve delta compression. This invention solves the problem of high computational overhead and slow speed in similarity recognition, and fingerprints can be used to assist delta compression encoding to achieve fast delta compression encoding. This invention establishes for the first time a connection between fingerprint generation in similarity detection and identical string recognition in delta compression encoding, achieving fast fingerprint generation and delta compression encoding. This technology can solve the problems of high overhead, slow speed, and ineffective deployment in storage systems associated with delta compression.
[0106] In summary, this invention establishes a link between fingerprint generation and delta encoding processing, accelerating the delta encoding process by utilizing the generated fingerprint. This technology maps file data content information to the fingerprint, effectively improving fingerprint generation speed while using fingerprints to determine file data similarity. This invention also ensures the delta compression reduction rate while effectively improving the efficiency of delta compression.
[0107] This invention was implemented and tested using code. The experimental environment was as follows: an Intel(R) Xeon(R) E5-2620 v4 CPU with 8 cores running at 2.10 GHz, 32 GB of DDR4 DRAM at 2133 MHz, and a 1 TiB solid-state drive. The experiment was run on a Linux system, specifically CentOS 7.9, and implemented using C++11 standard C++. The method and system parameters involved in this invention are configured as follows: 256 predefined 64-bit hash codes are used to classify the hash codes generated from the data string using rolling hashing. Then, 64 predefined 64-bit masks are used to reduce the dimensionality of the data string fingerprint and map it to the fingerprint.
[0108] To illustrate the superiority and high performance of this invention in delta-compressed similarity recognition, the invention was tested on six different datasets. Table 1 describes these six datasets, and the test results include the final reduction rate, the throughput of similarity fingerprint generation, and the throughput of delta encoding.
[0109] Table 1. Description of the dataset used in the test
[0110]
[0111] The comparative scheme used in this invention is as follows:
[0112] Odess: A traditional similarity recognition scheme that uses content-based Gear hashing to optimize similarity fingerprint generation.
[0113] Xdelta3: A classic delta compression coding algorithm that calculates an adler32 fingerprint for each data string, which has a high computational cost.
[0114] Gdelta: An optimized delta compression encoding algorithm that uses content-based Gear hashing to calculate a fingerprint for each data string. This scheme cannot utilize fingerprint acceleration.
[0115] During the testing process of this invention, the above schemes were combined as follows: SS-G, using the fingerprint generation method and Gdelta provided by this invention as the delta compression coding algorithm; SS-GS, using the fingerprint generation of this invention while simultaneously using the fingerprint-assisted Gdelta-based delta compression coding method provided by this invention; SS-X, using the fingerprint generation of this invention and Xdelta3 as the delta compression coding algorithm; OD-G, using the Gdelta generated by Odess' fingerprint as the delta compression coding algorithm; OD-X, using Odess' fingerprint generation and Xdelta3 as the delta compression coding algorithm.
[0116] Table 2 shows the test results comparing the final data reduction rates. Figure 3 This is a test comparison of the throughput of similar fingerprint generation. Figure 4 This is a test comparison result of delta compression encoding throughput.
[0117] Table 2 Final Data Reduction Rate Results
[0118]
[0119] In data reduction systems, the data reduction rate is always the most important and should be given priority. As shown in Table 2, compared to SS-G and OD-G, and to SS-X and OD-X, the fingerprint generated by this invention achieves a delta compression reduction rate comparable to existing fingerprint generation algorithms. Comparing SS-G and SS-GS, it can be seen that introducing fingerprint-assisted delta compression coding does not significantly reduce the data reduction rate.
[0120] This invention can improve the throughput of fingerprint generation in delta compression. Figure 3The results confirm this. With fingerprint assistance, this invention can improve the throughput of delta encoding in delta compression. Figure 4 The results confirmed this.
[0121] In summary, this invention does not significantly affect the reduction rate, but it achieves good results in terms of overall throughput of delta compression.
[0122] Thirdly, the present invention provides an electronic device, comprising: a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the method provided in the first or second aspect of the present invention.
[0123] The relevant technical solutions are the same as those provided in the first and second aspects of this invention, and will not be described in detail here.
[0124] Fourthly, the present invention also provides a computer-readable storage medium comprising a stored computer program, wherein the computer program, when executed by a processor, controls the device in which the storage medium is located to perform the method provided in the first or second aspect of the present invention.
[0125] The relevant technical solutions are the same as those provided in the first and second aspects of this invention, and will not be described in detail here.
[0126] Fifthly, the invention also provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the method provided in the first or second aspect of the invention.
[0127] The relevant technical solutions are the same as those provided in the first and second aspects of this invention, and will not be described in detail here.
[0128] Those skilled in the art will readily understand that the above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A method for generating fingerprints of file data, characterized in that, include: A1. Divide the file data into n data strings and calculate the hash code of each data string; where the length of the hash code of each data string is L; n and L are both positive integers; A2. For the hash code of each data string in the file data, the hash code features are matched with the hash code features in the preset seed feature set to classify the hash code of each data string into the corresponding hash code category, resulting in a total of K hash code sets; the preset seed feature set includes: K different hash code features that correspond one-to-one with the K hash code categories; K is the number of bits in the preset file data fingerprint; A3. Perform bitwise operations on each hash code in the k-th hash code set and the k-th mask in the preset mask set. When the bitwise operation results are all zero, set the k-th bit of the file data fingerprint to the first value; otherwise, set the k-th bit of the file data fingerprint to the second value. The preset mask set includes: randomly generating K different masks, each with a length of L; the first value is 0 or 1, the second value is 0 or 1, and when the first value is 0, the second value is 1, and when the first value is 1, the second value is 0. The first in the file data i The hash code of the data string is classified into the first... Under each hash code category; among them... For the first i The hash code of a data string converted to decimal; ; The bitwise operation is any one of the following: AND, OR, XOR, AND after shift, OR after shift, XOR after shift, AND after bitwise NOT, OR after bitwise NOT, and XOR after bitwise NOT. Wherein, the bitwise AND operation means performing a bitwise AND operation on the hash code after shifting it by a preset number of bits and then performing an AND operation on the mask; the bitwise OR operation means performing a bitwise OR operation on the hash code after shifting it by a preset number of bits and then performing an OR operation on the mask; the bitwise XOR operation means performing a bitwise XOR operation on the hash code after shifting it by a preset number of bits and then performing an XOR operation on the mask; the bitwise NOT AND operation means performing a bitwise AND operation on the hash code after inverting the bits and then performing an AND operation on the mask; the bitwise NOT OR operation means performing a bitwise OR operation on the hash code after inverting the bits and then performing an XOR operation on the hash code after inverting the bits and then performing an XOR operation on the mask.
2. The fingerprint generation method according to claim 1, characterized in that, The hash code of the data string is calculated using a rolling hash algorithm.
3. A method for compressing file data, characterized in that, include: B1. Using the fingerprint generation method according to any one of claims 1-2, generate the fingerprint of the file data A to be compressed. ; B2. Obtain the fingerprint hash table; the fingerprint hash table is used to store historically processed file data, its fingerprints, and fingerprint features; determine whether the fingerprint hash table is empty, if so, proceed to B9; otherwise, proceed to B3; B3. Query the fingerprint feature F of file data A in the fingerprint hash table. A Does it exist? If it does, retrieve F from the fingerprint hash table. A The corresponding file data B and its fingerprint If yes, proceed to B4; otherwise, proceed to B9. B4. Divide the file data B into n data strings, calculate the hash code of each data string in the file data B, and insert the hash code of each data string and its position in the file data B into a temporary hash table; n is a positive integer; B5. Divide the file data A into n data strings, and calculate the hash code of each data string in the file data A respectively; The first of the file data A i hash code of a data string The hash code features are matched with the hash code features in the preset seed feature set to determine... The corresponding hash code category ;in, The preset seed feature set is the preset seed feature set used in the fingerprint generation method according to any one of claims 1-2; B6. Determine the fingerprint. With the fingerprint The first in If the bits are the same, proceed to B7; otherwise, proceed to B8. B7. Query the temporary hash table If it exists, retrieve it from the temporary hash table. The corresponding data string is located in file data B, and is used as the first position of file data A. i If the encoding result of the data string is found, the operation ends; otherwise, proceed to B8. B8. The first of the file data A i The data string is directly used as the first data string in the file data A. i The encoding result of each data string is obtained, and the operation ends. B9. Transfer the file data A and its fingerprint. fingerprint features F A They are then inserted together into the fingerprint hash table, and the operation is complete.
4. The file data compression method according to claim 3, characterized in that, The fingerprint feature of file data is a preset number of bits at preset positions in the fingerprint of the file data.
5. The file data compression method according to claim 4, characterized in that, The fingerprint feature of file data is the bits in the latter half of the fingerprint of the file data.
6. The file data compression method according to any one of claims 3-5, characterized in that, The corresponding hash code category for ; for The result after conversion to decimal; K is the number of hash code features in the preset seed feature set.
7. An electronic device, characterized in that, include: A memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the method according to any one of claims 1-6.
8. A computer-readable storage medium, characterized in that, The computer-readable storage medium includes a stored computer program, wherein when the computer program is executed by a processor, it controls the device in which the storage medium is located to perform the method according to any one of claims 1-6.