A file deduplication storage method based on a hash algorithm

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a three-level hash verification mechanism and hierarchical hash index management, the problems of high hash collision risk and rapid memory expansion in existing file deduplication storage technologies are solved, achieving efficient file deduplication storage and improving the system's scalability and deduplication rate.

CN122240576APending Publication Date: 2026-06-19QUANZHOU INST OF INFORMATION ENG

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: QUANZHOU INST OF INFORMATION ENG
Filing Date: 2026-05-25
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Existing file deduplication storage technologies based on hash algorithms suffer from problems such as high hash collision risk, high computational cost, rapid increase in memory usage, limited system scalability, decreased deduplication rate across versions, and lack of atomicity guarantee in reference count management.

Method used

A three-level hash verification mechanism is adopted, combined with adaptive content block division and hierarchical hash index management. It quickly eliminates and determines non-duplicate file blocks through fast hashing and Bloom filters, performs accurate verification using strong hash values, controls memory usage through hierarchical hash index, and ensures the atomicity of reference count changes by combining a write-ahead log mechanism.

Benefits of technology

It significantly reduces the memory footprint of hash indexes, improves deduplication rate, ensures data accuracy and system scalability, reduces computational overhead, and reduces the interference of deletion operations on read and write operations through a delayed recycling strategy.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122240576A_ABST

Patent Text Reader

Abstract

This invention discloses a file deduplication storage method based on a hash algorithm, comprising: selecting block parameters for the file to be stored according to file type, and performing content-based variable-length block segmentation; performing three-level hash verification on each file block in sequence, pre-screening with a fast hash query Bloom filter, performing strong hash precise verification on suspected duplicate blocks, and performing byte-level sampling comparison as a fallback on strong hash hit blocks, generating three types of verification conclusion signals: new block, duplicate block, or collision; and performing new block writing or deduplication reference according to the verification conclusion signal, updating the hierarchical hash index and triggering dynamic migration of hot, warm, and cold index intervals. This invention effectively controls the risk of hash collisions while reducing the memory usage of the hash index and improving the file deduplication rate.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer storage technology, specifically to a file deduplication storage method based on a hash algorithm. Background Technology

[0002] Existing file deduplication technologies based on hash algorithms typically use a single hash function to calculate hash values for files or file blocks, identifying duplicate data by comparing these hash values. The above approach has the following shortcomings: First, using a weak hash function carries a high risk of hash collisions, which may lead to file blocks with different content being misjudged as duplicates, resulting in data corruption. Using a strong hash function incurs high computational costs, creating significant performance bottlenecks in large-scale file writing scenarios.

[0003] Secondly, hash indexes are typically loaded entirely into memory. As storage scale increases, the memory footprint of the index expands dramatically, limiting system scalability.

[0004] Third, existing block partitioning schemes mostly use fixed-length blocks. When local content of a file is changed, the block boundaries shift as a whole, resulting in a significant decrease in cross-version deduplication rate. Fourth, reference counting management lacks atomicity guarantees when deleting files, which can easily lead to data silos or accidental deletion issues when the system malfunctions.

[0005] To address the aforementioned issues, it is necessary to propose a file deduplication storage method based on a hash algorithm. Summary of the Invention

[0006] The purpose of this invention is to solve the problems existing in the background art and to propose a file deduplication storage method based on hash algorithm.

[0007] The objective of this invention can be achieved through the following technical solutions: This invention provides a file deduplication storage method based on a hash algorithm, comprising the following steps: Step S1: The processor receives the file stream of the file to be stored, reads the identifier byte in the file header, compares the identifier byte with the file type feature table pre-stored in the memory, and determines the file type of the file to be stored. The processor reads the block parameter group corresponding to the file type from the file type feature table. The block parameter group includes the minimum block length lower limit of the sliding window, the maximum block length upper limit, and the boundary trigger modulus of the rolling hash. The processor writes the block parameter group into the register of the block processing unit. The block processing unit slides the file stream byte by byte using a sliding window of a preset byte length, continuously calculates the Rabin rolling fingerprint value, and determines whether the result of taking the modulo of the Rabin rolling fingerprint value with the boundary trigger modulo value is equal to the preset boundary trigger value and whether the number of bytes processed is greater than the minimum block length lower limit. If the following conditions are met: the result of taking the modulo of the Rabin rolling fingerprint value with the boundary trigger modulo value is equal to the preset boundary trigger value and the number of bytes processed is greater than the minimum block length lower limit, then the current position is marked as a block boundary to form a file block; If the number of bytes processed consecutively has reached the maximum block length limit but the following conditions are not met: the result of taking the modulo of the Rabin rolling fingerprint value with the boundary trigger modulo value is equal to the preset boundary trigger value and the number of bytes processed is greater than the minimum block length limit, then the file is forcibly truncated at the current position to form a file block. The above process continues until the file stream is processed, outputting a sequence of file blocks and a list of file block descriptors. The list of file block descriptors records the offset position and byte length of each file block in the file to be stored.

[0008] Step S2: The processor sequentially retrieves each file block and its descriptor from the file block descriptor list, and performs a three-level hash check on each file block sequentially. In the first stage, the processor calls a fast hash function to calculate a fast hash value for the byte sequence of the file block, queries the Bloom filter with the fast hash value, and if the Bloom filter returns a non-existent signal, the processor generates a verification conclusion signal as a new block signal and writes the file block, the offset position of the file block, the byte length of the file block, the fast hash value, and the new block signal into the verification conclusion queue; if the Bloom filter returns an exist signal, the process proceeds to the second stage. In the second stage, the processor calls a strong hash function to calculate a strong hash value for the byte sequence of the file block, and uses the strong hash value to query the precise index table. If there is no record corresponding to the strong hash value in the precise index table, the processor generates a verification conclusion signal as a new block signal. If it exists, it proceeds to the third stage. In the third stage, the processor reads the physical address of the stored original block from the precise index table, reads the byte sequence of the original block from the storage medium, and performs a sampling position byte comparison between the byte sequence of the file block and the original block at a preset sampling interval. If all sampling position bytes are consistent, the processor generates a verification conclusion signal as a duplicate block signal and writes the physical address of the original block into the verification conclusion queue. If there are inconsistent sampling position bytes, the processor generates a verification conclusion signal as a collision signal and processes the file block according to the new block signal branch. The processor writes the verification conclusion signal, the fast hash value, the strong hash value, the offset position, the byte length of the file block, and the physical address of the original block corresponding to the duplicate block signal entry into the verification conclusion queue. The new block signal entry directly determined by the first level does not carry the strong hash value. In step S3, the strong hash value is calculated for this type of entry when performing new block writing.

[0009] Step S3: The processor sequentially retrieves each entry from the verification conclusion queue and performs branch processing according to the type of the verification conclusion signal. For a new block signal branch, the processor calculates the strong hash value for the new block signal entry directly determined by the first level, writes the byte sequence of the file block to the storage medium, the storage controller returns the physical address of the file block, the processor inserts the fast hash value into the Bloom filter, creates a new record in the exact index table with the strong hash value as the key, the value of the new record includes the physical address of the file block, the byte length of the file block, the initial value of the reference count, the initial value of the access frequency count, and the access timestamp at the current moment, creates a record in the file block mapping table containing the offset position of the file block, the physical address of the file block, the byte length of the file block, and the strong hash value, and writes the above four items as storage mapping records to the output queue; For a duplicate block signal branch, the processor looks up the record corresponding to the strong hash value in the precise index table and increments its reference count by one, updates the access frequency count and the access timestamp, creates a record in the file block mapping table containing the offset position of the file block, the physical address of the original block, the byte length of the file block, and the strong hash value, and writes the above four items as storage mapping records into the output queue; for a collision signal branch, the processor processes the entire process of the new block signal branch and writes the collision event into the collision event log.

[0010] After each branch has been processed, the processor performs inter-level migration of the records in the exact index table according to the following migration rules: The precise index table is divided into a hot index area stored in DRAM, a warm index area stored in NVMe SSD, and a cold index area stored in HDD. If the access timestamp of a record in the hot index area is more than a preset cooling threshold from the current time, the record is migrated to the warm index area. If the access timestamp of a record in the warm index area is more than a preset archiving threshold from the current time, the record is migrated to the cold index area. If the access frequency count of a record in the warm index area or the cold index area exceeds a preset hot spot threshold, the record is promoted to the hot index area.

[0011] Step S4: The processor arranges all storage mapping records in the output queue in ascending order by offset position, constructs the file block mapping table, generates a file unique identifier by hashing the file path and creation timestamp, concatenates the strong hash values of each file block in order and generates a file-level strong hash value by hashing again, and constructs a file metadata record by combining the file unique identifier, the total number of bytes in the file, the total number of file blocks, the storage address of the file block mapping table, and the file-level strong hash value. The file metadata record is written into the file metadata index table in the metadata storage area with the file unique identifier as the primary key. For file deletion requests, the processor reads the file metadata record from the file metadata index table, traverses each record in the file block mapping table, retrieves the strong hash value in each record, searches for the corresponding record in the precise index table, and decrements its reference count by one. If the reference count is zero after decrementing, the corresponding physical address is written into the delayed reclamation queue and the record in the precise index table is marked as pending reclamation. Before each reference count change is executed, the processor appends a log entry containing the strong hash value, the reference count change, and the operation time to the write-ahead log file and forces it to disk, and then executes the reference count update of the exact index table. When the system I / O load is lower than a preset load threshold, the background garbage collection process reads the physical addresses in the delayed collection queue in batches, performs the deletion of physical blocks and the reclamation of storage space, deletes the corresponding records to be reclaimed from the precise index table, and rebuilds the Bloom filter.

[0012] Step S5: The processor starts a background self-test task according to a preset self-test cycle, traverses all records in the precise index table, reads the byte sequence of the corresponding physical block from the storage medium for each record and recalculates the strong hash value, compares the calculation result with the strong hash value key stored in the precise index table, and generates an abnormal signal if the two are inconsistent. The strong hash value, the physical address of the record storage and the time of fault discovery are written into the fault entry log and the record is marked as pending repair. The processor initializes a new Bloom filter, traverses all records in the hot index area whose reference count is greater than zero and whose status is not pending reclamation, extracts the fast hash value and inserts it into the newly initialized Bloom filter in sequence, and after reconstruction, protects the original Bloom filter with the reconstructed Bloom filter atomically by using a read-write lock. The processor counts the number of fault entries, the total number of blocks to be recycled, the estimated false positive rate of the Bloom filter, and the distribution ratio of the number of records in the hot index area, the warm index area, and the cold index area. It generates a system health status report and outputs it to the monitoring interface. If the number of fault entries is greater than zero, a repair signal is generated. If the estimated false positive rate exceeds the preset false positive rate limit, a capacity expansion signal is generated.

[0013] Compared with the prior art, the beneficial effects of the present invention are: This invention employs a three-level hash verification mechanism, using fast hashing and a Bloom filter to quickly eliminate duplicate file blocks, limiting strong hash calculations to a small number of suspected duplicates, thus significantly reducing computational overhead while ensuring data accuracy. Content-based adaptive block segmentation aligns block boundaries with the file's semantic structure, improving cross-version deduplication rates. Hierarchical hash indexing keeps hot fingerprints in memory and low-frequency fingerprints on SSDs and HDDs, effectively controlling memory usage. A write-ahead log mechanism ensures the atomicity of reference count changes, and combined with a delayed garbage collection strategy, reduces the interference of deletion operations on read and write operations. Attached Figure Description

[0014] To facilitate understanding by those skilled in the art, the present invention will be further described below with reference to the accompanying drawings: Figure 1 This is a flowchart illustrating the steps of a file deduplication storage method based on a hash algorithm proposed in an embodiment of the present invention; Detailed Implementation

[0015] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0016] This invention provides a file deduplication storage method based on a hash algorithm. During the file writing process to the storage system, through the coordinated efforts of hierarchical hash verification, adaptive content segmentation, and layered hash index management, it effectively controls the risk of hash collisions while significantly reducing the memory footprint of the hash index and improving the deduplication rate. These are described in detail below.

[0017] Please see Figure 1 , Figure 1 This is a flowchart illustrating a file deduplication storage method based on a hash algorithm disclosed in an embodiment of the present invention.

[0018] like Figure 1 As shown, the method for deduplicating and storing this file may include the following steps: S1. Identify the file type of the file to be stored, select the corresponding block parameters according to the file type, perform adaptive content block processing on the file to be stored, and output the file block sequence.

[0019] Specifically, the processor receives the file stream of the file to be stored, reads the identifier bytes in the file header, compares them with a file type feature table pre-stored in memory, and determines the file type of the file to be stored. The file type feature table stores the block parameter set corresponding to each file type. The block parameter set includes at least the minimum block length lower limit min_size, the maximum block length upper limit max_size, and the boundary trigger modulo value mod_val of the rolling hash.

[0020] Based on the identified file type, the processor reads the corresponding block parameter group from the file type feature table and writes the minimum block length lower limit, maximum block length upper limit, and boundary trigger modulus of the sliding window as block control parameters into the register of the block processing unit.

[0021] The block processing unit slides the file stream byte by byte using a sliding window with a preset byte length, continuously calculates the Rabin scroll fingerprint value R for the window content, and determines whether the result of taking the modulo of R with the boundary trigger modulo value is equal to the preset boundary trigger value, and whether the number of bytes processed is greater than the minimum block length of the sliding window.

[0022] If the following conditions are met: the result of taking the modulo of the Rabin rolling fingerprint value with the boundary trigger modulo value is equal to the preset boundary trigger value and the number of bytes processed is greater than the minimum block length lower limit, then the current position is marked as a block boundary, forming a file block; If the number of bytes processed consecutively has reached the maximum block length limit but the following conditions are not met: the result of taking the modulo of the Rabin rolling fingerprint value with the boundary trigger modulo value is equal to the preset boundary trigger value and the number of bytes processed is greater than the minimum block length limit, then the file is forcibly truncated at the current position to form a file block. The above process continues until the file stream is processed, and finally outputs a sequence of file blocks consisting of several file blocks, as well as a list of file block descriptors that record the offset position and byte length of each file block in the original file. This list is passed in as the input parameter for the subsequent step S2.

[0023] It should be noted that this step uses content-based variable-length chunking (CDC), which uses the rolling fingerprint of the file content itself as the basis for determining the chunk boundaries. This ensures that changes to the local content of the file only affect a very small number of adjacent chunks, while the content and boundaries of the remaining chunks remain unchanged. As a result, in scenarios where the file undergoes version iterations or local modifications, it can significantly retain the ability to deduplicatize chunks across versions, thereby greatly improving the actual deduplication rate.

[0024] As an alternative implementation, the block parameters are bound to the file type, rather than using globally uniform fixed parameters.

[0025] Specifically, for plain text files (such as .txt and .log), the minimum block length of the sliding window can be set to 2KB and the maximum block length can be set to 16KB to capture repeated segments of text content with finer granularity. For video files (such as .mp4 and .mkv), the minimum block length of the sliding window can be set to 64KB and the maximum block length can be set to 1MB, so as to avoid excessively frequent boundary triggering due to the complexity of video encoding. For database backup files (such as .bak and .dump), the minimum block length of the sliding window can be set to 8KB, and the maximum block length can be set to 128KB to match the typical size of a database page. These parameters are all stored in the file type characteristic table, and this embodiment of the invention does not limit the specific values.

[0026] It should be noted that combining block parameters with file type awareness allows block boundaries to align as closely as possible with the logical semantic structure of the file, thereby achieving near-optimal deduplication results across different file types and avoiding the problem of low deduplication rates for some file types when using globally uniform block parameters.

[0027] S2. Perform hierarchical hash verification on each file block in the file block sequence in sequence, generate the verification conclusion signal corresponding to each file block, and write the verification conclusion signal and the corresponding file block descriptor into the verification conclusion queue.

[0028] Specifically, the processor sequentially extracts file blocks and their descriptors (including the offset position and byte length of each file block in the original file) from the file block sequence output in step S1, and performs the following hierarchical hash verification process on the file blocks: Level 1: Fast hash pre-filtering.

[0029] The processor calls an unencrypted fast hash function (xxHash64 or MurmurHash3 can be used in this embodiment of the invention, and the specific hash function is not limited in this embodiment of the invention) to calculate the complete byte sequence of the file block and output a 64-bit fast hash value.

[0030] Furthermore, the processor queries the Fast Hash Bloom filter residing in the hot index area of memory (see step S3) using a 64-bit Fast Hash value.

[0031] If the fast hash Bloom filter returns a "not found" signal, the processor directly generates a verification conclusion signal =NEW (indicating that the block is a new block and confirms that it is not a duplicate), writes (file block, offset position of file block in the original file, byte length of file block, 64-bit fast hash value, verification conclusion signal =NEW) into the verification conclusion queue, and jumps to step S3 to process the next branch. If the Fast Hash Bloom filter returns an "existence" signal, then proceed to the second level of exact hash verification.

[0032] Level 2: Strong hash precise verification.

[0033] The processor calls the encryption strength hash function (SHA-256 can be used in this embodiment to output a 256-bit strong hash value; this embodiment does not limit the specific hash function) to calculate the strong hash value for the file block. The processor then queries the strong hash value against the precise index table of the hierarchical hash index.

[0034] If no record corresponding to a strong hash value exists in the exact index table, the processor generates a verification conclusion signal =NEW; if a record corresponding to a strong hash value exists in the exact index table, the process proceeds to the third-level byte-level sampling comparison.

[0035] Level 3: Byte-level sampling comparison as a fallback.

[0036] In the case of a strong hash lookup hit, the processor reads the physical address of the original block (addr_ref) from the exact index table, reads the content of the original block (chunk_ref) from the storage medium, and performs byte-level sampling comparison between the file block and the original block at a preset sampling interval (step_sample).

[0037] In this embodiment of the invention, the preset sampling interval can be set to 64 bytes, which corresponds to the length of the extracted file block divided by 64 positions for comparison. This embodiment of the invention does not limit the specific value of the preset sampling interval.

[0038] If all the bytes at all sampled locations are identical, the processor generates a verification conclusion signal = DUP (indicating that the block is a duplicate block, confirming the duplicate), and writes the physical address of the original block as an additional parameter into the verification conclusion queue. If any sampled bytes are inconsistent, it is determined to be a hash collision. The processor generates a verification conclusion signal = COLLISION, records the collision event log, and processes the file block according to the branch with verification conclusion signal = NEW.

[0039] The aforementioned verification conclusions (file block, 64-bit fast hash value, verification conclusion signal, offset position of the file block in the original file, byte length of the file block, and strong hash value and physical address of the original block when verification conclusion signal = DUP) are completely written into the verification conclusion queue as input parameters for step S3. Among these, entries with verification conclusion signal = NEW and directly determined by the first level do not carry strong hash values. Step S3, when performing new block writing, calculates strong hash values for these entries and writes them into the precise index table.

[0040] It should be noted that existing technologies typically rely on a single hash function for deduplication. Weak hashing carries the risk of collisions, while strong hashing, due to its high computational cost, creates a significant performance bottleneck during large-scale file writing. This invention's three-level verification mechanism uses fast hashing with extremely low computational cost as the first filtering barrier. A Bloom filter eliminates the vast majority of definitively non-duplicate file blocks in O(1) time, ensuring that strong hashing calculations are only triggered in a few "suspected duplicate" cases, thus significantly reducing the overall frequency of strong hashing calculations. The introduction of byte-level sampling comparison in the third level addresses the extremely low probability of strong hash collisions, serving as a final safety net to ensure data accuracy throughout the deduplication process. The coordinated operation of the three-level verification mechanism achieves an optimal balance between security and computational performance.

[0041] As an alternative implementation, the processor can further dynamically adjust the verification level based on the file's security level identifier field (which can be written by the caller in the metadata of the file write request).

[0042] For files with a high security level (HIGH), when the first-level Bloom filter returns a "not found" signal, the processor does not directly generate a verification conclusion signal =NEW. Instead, it further calls the encryption strength hash function to supplement the calculation of the strong hash value before generating the verification conclusion signal =NEW and writing it into the queue. At the same time, it forces the second and third levels of verification to be enabled, regardless of the query result of the first-level Bloom filter. For files with a low security level (LOW), the processor directly generates a verification conclusion signal =NEW when the first level returns "not found," and only performs the second-level strong hash verification when it returns "exists," skipping the third-level sampling comparison to reduce processing latency. This embodiment of the invention does not limit the classification method of security levels or their mapping relationship with verification levels.

[0043] S3. Read the verification conclusions of each file block sequentially from the verification conclusion queue, and perform the corresponding index update operation or deduplication operation according to the type of verification conclusion signal. At the same time, complete the dynamic migration management of the hierarchical hash index and output the final storage mapping record of each file block.

[0044] Specifically, the processor sequentially retrieves entries (file block, 64-bit fast hash value, strong hash value, verification conclusion signal, offset position of the file block in the original file, byte length of the file block, and physical address of the original block when verification conclusion signal = DUP) from the verification conclusion queue output in step S2, and executes the following branch processing based on the value of the verification conclusion signal: Branch 1: Verification conclusion signal = NEW branch (new block write): For the verification conclusion signal =NEW entry from the first level of direct determination in step S2 (the strong hash value field in the queue is empty), the processor first calls the encryption strength hash function to calculate the strong hash value of the file block before performing the write operation, and fills in the strong hash value field of the entry; for the verification conclusion signal =NEW entry from the second level of precise hash verification, the strong hash value has been calculated in step S2, and the processor reads and uses it directly.

[0045] The processor writes the byte sequence of the file block to the storage medium, and the storage controller returns the physical address of the file block, addr_new. The processor then performs the following index update operation: Update Operation 1: Insert a 64-bit fast hash value into the Bloom filter; Update operation 2: Create a new record in the exact index table with a strong hash value as the key and (physical address of the file block, byte length of the file block, reference count = 1, access frequency count = 1, last access timestamp = block write time) as the value; It's important to note that `ref_count=1` means the reference count is initialized to 1. When a block is first written, only that file references it, so the reference count starts from 1. Each subsequent file that hits this block increments the reference count (`ref_count++`); each deleted file that references this block decrements the reference count (`ref_count--`); when the count reaches zero, delayed garbage collection is triggered.

[0046] It's important to note that the access frequency count (access_freq) records the cumulative number of times the physical block has been accessed (hit deduplication references) since it was written. access_freq=1 means it's initialized to 1. The access frequency count tracks "how many times this block has been hit deduplication references in history," and it's a cumulative count that only increases and never decreases. Regardless of whether the file was later deleted, the access frequency count increments by one every time a new file is written to this block; it won't decrease due to file deletion.

[0047] It's important to note that the last access timestamp `last_access_time` records the Unix timestamp (floating-point seconds) when the physical block was last hit for deduplication, initialized to the block write time `T_now`. This field works in conjunction with the access frequency count to drive the hierarchical migration logic. A block with a high access frequency count but a very old last access timestamp indicates that it was once a hotspot but may no longer be accessed (e.g., an image of an older version of software) and should be gradually moved down the hierarchy.

[0048] Update operation 3: Create a record for the current file in the file block mapping table (offset position of the file block in the original file, physical address of the file block, byte length of the file block, strong hash value of the file block), indicating that the content of the file block at the offset position in the original file is stored at the physical address of the file block. The processor writes (offset position of the file block in the original file, physical address of the file block, byte length of the file block, strong hash value of the file block) as the final storage mapping record of the current block to the output queue.

[0049] Branch 2: Verification conclusion signal = DUP branch (duplicate block deduplication): The processor does not perform any data write operations. Instead, it directly looks up the record corresponding to the strong hash value in the exact index table, performs an atomic increment operation (ref_count++) on its reference count field, and updates the access frequency count and the most recent access timestamp to the block write time.

[0050] The processor creates a record for the current file in the file block mapping table (offset of the file block in the original file, physical address of the original block, byte length of the file block, and strong hash value of the file block), so that the current file's block directly references the already stored original block. The processor writes (offset of the file block in the original file, physical address of the original block, byte length of the file block, and strong hash value of the file block) as the final storage mapping record for the current block to the output queue.

[0051] Branch 3: Verify the conclusion signal = COLLISION branch: The processor processes the entire process according to the verification conclusion signal = NEW branch, but at the same time appends a record to the collision event log, containing (64-bit fast hash value, strong hash value, physical address of the file block, and block write time), for subsequent audit analysis.

[0052] After the above three branches are processed, step S3 further performs dynamic migration management on the hierarchical hash index.

[0053] In this embodiment, the hash index is divided into three storage areas: The system consists of a hot index area (stored in DRAM memory, covering block fingerprint records accessed frequently), a warm index area (stored in NVMe SSD, covering block fingerprint records accessed in a medium frequency), and a cold index area (stored on HDD disk, covering block fingerprint records accessed infrequently).

[0054] After processing each verification conclusion signal, the processor reads the access frequency count field of the precise index table record involved in this operation and triggers inter-level migration of index entries according to the following migration rules: If the timestamp of the most recent access to a record in the hot index area is more than the current time away from the preset cooling threshold (in this embodiment of the invention, the preset cooling threshold can be set to 24 hours), then the record is moved from the hot index area to the warm index area, and the corresponding DRAM is released. If the timestamp of the most recent access to a record in the warm index area is more than the current time away from the preset archiving threshold T_archive (in this embodiment of the invention, the preset archiving threshold can be set to 7 days), then the record will be migrated from the warm index area to the cold index area. Conversely, if the access frequency count of a record exceeds the preset hotspot threshold, the record will be promoted from the warm / cold index area to the hot index area.

[0055] It should be noted that this step adopts a three-level hierarchical index structure, combined with a dynamic migration strategy based on access frequency and time decay. This ensures that the DRAM memory only needs to reside in the fingerprints of truly high-frequency access hotspot blocks, while the fingerprints of medium and low-frequency blocks are stored in the SSD and HDD in order of access frequency. This achieves a reasonable balance between query latency and memory usage, allowing the system to keep DRAM usage within a controllable range even when the storage scale increases significantly.

[0056] As an optional implementation, the processor can also maintain a lightweight Bloom filter preflight layer in the hot index area, specifically for the first-level fast hash lookup in step S2. The Bloom filter preflight layer only stores the 64-bit fast hash values of all records in the current hot index area, and its memory footprint is much smaller than that of the exact index table, which can further accelerate the first-level "not found" judgment and avoid invalid access to the exact index table.

[0057] S4. After all file blocks have been processed, read the final storage mapping records of each block output in step S3, construct a complete file index structure, and persist the file index structure to the metadata storage area to complete the file deduplication storage process; for file deletion requests, perform delayed reclamation of storage space according to the reference counting safe reclamation process.

[0058] Specifically, after the output queue of step S3 has received the final storage mapping records of all blocks of the file to be stored, the processor sorts all records in the queue (offset position of the file block in the original file, physical address of the file block, byte length of the file block, strong hash value of the file block) in ascending order according to the offset position of the file block in the original file, and constructs a file block mapping table. This mapping table completely describes the correspondence between each logical block of the file and the physical storage address.

[0059] Furthermore, the processor constructs a file metadata record, which includes: a unique file identifier (generated by hashing the file path and creation timestamp), the total number of bytes in the file, the total number of file blocks, the storage address of the file block mapping table in the metadata storage area, and a file-level strong hash value (obtained by concatenating the strong hash values of each block in order and then recalculating SHA-256, used for file integrity verification).

[0060] The aforementioned file metadata records are written to the file metadata index table in the metadata storage area, and an index is created using the file's unique identifier as the primary key.

[0061] At this point, the deduplication and storage process for this file is complete.

[0062] For a file deletion request, after receiving the deletion instruction containing the file's unique identifier, the processor reads the file metadata record from the metadata storage area, and then reads all records in the file block mapping table (the offset position of the file block in the original file, the physical address of the file block, the byte length of the file block, and the strong hash value of the file block). For each record, the processor directly retrieves the strong hash value of the file block, looks up the corresponding record in the exact index table, and performs an atomic decrement operation (ref_count--) on its reference count field.

[0063] The processor checks the reference count after decrementing by one: if the reference count is greater than 0, the physical block is still referenced by other files, and the processor does not perform any physical deletion operation; If the reference count is 0, the processor appends the physical address of the file block to the delayed reclamation queue and marks the corresponding strong hash value record in the exact index table as "pending reclamation" instead of immediately deleting it from the storage medium.

[0064] When the system I / O load is lower than a preset load threshold (in this embodiment, the preset load threshold can be set to 30%), the background garbage collection process reads the physical addresses in the delayed collection queue in batches, performs the actual deletion of physical blocks and the reclamation of storage space, deletes the corresponding "to be reclaimed" record from the precise index table, and updates the Bloom filter pre-detection layer.

[0065] As an optional approach in this embodiment, the update of the Bloom filter pre-detection layer can be achieved through periodic reconstruction to avoid the problem that the Bloom filter does not support deletion operations.

[0066] Each reference count change operation is guaranteed to be atomic through a write-ahead log mechanism. Before executing the reference count modification, the processor first appends the record of the operation (strong hash value, reference count change, operation time) to the WAL file and forces it to be written to disk, and then executes the reference count update of the precise index table in memory.

[0067] It's important to note that the reference count change is a symbolic change in the WAL entry describing whether "this operation increases or decreases the reference count of a physical block." It represents a increment of +1 and a decrement of -1. It's the core data carrier that ensures the atomicity of reference count changes and the recoverability from crashes in the WAL mechanism. The reference count change doesn't record the state, but rather the operation itself, which allows the WAL's replay logic to maintain correctness under both concurrent and exceptional scenarios.

[0068] When the system crashes and restarts, the recovery process reads the WAL file and re-executes any incomplete reference count changes to ensure the consistency of reference counts and avoid data silos (physical blocks whose reference counts have reached zero but have not been reclaimed, permanently occupying storage space) or accidental deletion (physical blocks whose reference counts have not reached zero but have been accidentally reclaimed) caused by system anomalies.

[0069] It's important to note that this step employs a delayed reclamation design rather than immediate deletion. This avoids high-frequency random write operations to the storage medium during file deletion (immediate deletion generates numerous discrete small write requests in batch deletion scenarios, which is particularly detrimental to HDDs). Furthermore, by concentrating reclamation operations during off-peak hours, it effectively reduces interference with normal read / write operations. The WAL (Write-Ahead Logging) mechanism ensures the transactional semantics of reference count changes from a persistence perspective, serving as the fundamental guarantee for the reliability of the entire secure reclamation process.

[0070] S5. Periodically perform consistency self-checks and Bloom filter reconstruction of the hash index, output system health status reports, and trigger corresponding repair signals based on the report conclusions.

[0071] Specifically, the processor starts a background self-test task according to a preset self-test cycle (in this embodiment, the preset self-test cycle can be set to be executed once every 24 hours). This task performs the following operations in sequence: Precise index table and physical block consistency check: The processor iterates through all records in the exact index table. For each record, it reads the physical address of its record storage, reads the byte sequence of the corresponding physical block from the storage medium, recalculates the strong hash value, and compares the calculation result with the strong hash value key stored in the exact index table.

[0072] If the two match, the record is considered to be in a healthy state. If the two are inconsistent, the record is determined to be in an abnormal state. The processor generates an abnormal signal, writes the record (strong hash value, physical address of record storage, time of fault discovery) into the fault entry log, and marks the record as "to be repaired".

[0073] Bloom filter preflight layer reconstruction: The processor initializes a new empty Bloom filter, traverses all currently valid records (reference count greater than 0 and status not "awaiting reclamation") in the hot index area, extracts their 64-bit fast hash values, and inserts them into the new Bloom filter in sequence.

[0074] After reconstruction, the processor replaces the original Bloom filter preflight layer with the reconstructed Bloom filter atoms, releasing the memory occupied by the old Bloom filter preflight layer. This replacement operation is protected by read-write locks to prevent concurrent queries in step S2 during reconstruction from accessing the intermediate state of the Bloom filter.

[0075] System health status report generation: The processor compiles the following metrics for this self-test cycle: number of fault entries, total number of blocks to be reclaimed, estimated Bloom filter false positive rate (calculated from the number of elements already inserted in the current Bloom filter pre-test layer and the bit array length), and the distribution ratio of records in the hot / warm / cold three-layer index.

[0076] The processor writes the above indicators into the system health status report and outputs it to the monitoring interface. The processor then triggers corresponding repair signals based on the content of the system health status report. If the number of fault entries is greater than 0, a repair signal is generated, triggering the data integrity repair process (such as restoring the corresponding physical block from the backup copy and updating the precise index table). If the estimated false positive rate of the Bloom filter exceeds the preset false positive rate limit (in this embodiment of the invention, the preset false positive rate limit can be set to 1%), an expansion signal is generated to trigger the bit array expansion and reconstruction process of the Bloom filter.

[0077] It should be noted that during the long-term operation of hash indexes, potential risks affecting the reliability of the deduplication system include silent data corruption in the storage medium, state drift between index records and physical blocks, and increased false positive rates in the Bloom filter due to continuous insertion. Step S5 incorporates these hidden risks into an observable and manageable controllable scope through periodic proactive self-checking and repair signal mechanisms, ensuring the data integrity and index health of the entire deduplication storage system during long-term operation.

[0078] It should be noted that the steps in the method of the embodiments of the present invention can be adjusted, combined, or deleted according to actual needs.

[0079] It should be understood that the terms “comprising” and “including” used in this disclosure and claims indicate the presence of the described features, integrals, steps, operations, elements and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.

[0080] It should also be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the disclosure. As used in this disclosure and claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in this disclosure and claims means any combination and all possible combinations of one or more of the associated listed items, and includes such combinations; The preferred embodiments of the present invention disclosed above are merely illustrative of the invention. These preferred embodiments do not exhaustively describe all details, nor do they limit the invention to any specific implementation. Clearly, many modifications and variations can be made based on the content of this specification. This specification selects and specifically describes these embodiments to better explain the principles and practical applications of the invention, thereby enabling those skilled in the art to better understand and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A hash algorithm based file deduplication storage method, characterized in that, Includes the following steps: File type identification is performed on the file stream of the file to be stored. Based on the file type, the corresponding block parameter group is read from the file type feature table. Content-based variable-length block segmentation is performed on the file stream, and the file block sequence is output. Hierarchical hash verification is performed sequentially on each file block in the file block sequence to generate a verification conclusion signal, and the verification conclusion signal is written into the verification conclusion queue. Based on the signal type of each entry in the verification conclusion queue, perform physical writing and update the hash index for new block signal entries and collision signal entries, perform deduplication and update the reference count for duplicate block signal entries, and output the storage mapping record. Based on the storage mapping record, a file block mapping table and a file metadata record are constructed and persistently written to the metadata storage area; for a file deletion request, the reference count of each file block is decremented by one by traversing the file block mapping table, and when the reference count reaches zero, the corresponding physical address is written to the delayed reclamation queue. Periodically perform consistency self-checks and Bloom filter reconstruction of the hash index, output system health status reports, and trigger repair or expansion signals based on the report content.

2. The file deduplication storage method based on hash algorithm according to claim 1, characterized in that, The block parameter set includes the minimum block length lower limit of the sliding window, the maximum block length upper limit, and the boundary trigger modulus of the rolling hash; the content-based variable length block segmentation specifically involves: the block processing unit sliding the file stream byte by byte with a sliding window of a preset byte length, continuously calculating the Rabin rolling fingerprint value, and determining whether the result of taking the modulus of the Rabin rolling fingerprint value with the boundary trigger modulus is equal to the preset boundary trigger value and whether the number of bytes processed is greater than the minimum block length lower limit; If the following conditions are met: the result of taking the modulo of the Rabin rolling fingerprint value with the boundary trigger modulo value is equal to the preset boundary trigger value and the number of bytes processed is greater than the minimum block length lower limit, then the current position is marked as a block boundary; If the number of bytes processed consecutively has reached the maximum block length limit but the following conditions are not met: the result of taking the modulo of the Rabin rolling fingerprint value with the boundary trigger modulo value is equal to the preset boundary trigger value and the number of bytes processed is greater than the minimum block length limit, then forced truncation is performed.

3. The file deduplication storage method based on hash algorithm according to claim 2, characterized in that, The file type feature table stores different block parameter groups for different file types; for text files, the minimum block length is 2KB and the maximum block length is 16KB; for video files, the minimum block length is 64KB and the maximum block length is 1MB; for database backup files, the minimum block length is 8KB and the maximum block length is 128KB.

4. The file deduplication storage method based on hash algorithm according to claim 1, characterized in that, The hierarchical hash verification includes three levels: The first stage involves calculating a fast hash value for the file block using a fast hash function and querying a Bloom filter. If the Bloom filter returns a non-existent signal, a new block signal is generated. If a signal is returned, proceed to the second stage; The second level involves calculating a strong hash value using a strong hash function and querying the precise index table. If the precise index table does not contain a record corresponding to the strong hash value, a new block signal is generated. If it does, the process proceeds to the third level. The third level involves reading the physical address of the stored original block from the precise index table, and performing byte-level sampling comparison between the file block and the original block at a preset sampling interval. If the bytes are all consistent, a duplicate block signal is generated; otherwise, a collision signal is generated.

5. The file deduplication storage method based on a hash algorithm according to claim 4, characterized in that, The processor dynamically adjusts the verification level based on the file's security level identifier field. For files with a high security level, the processor further calculates a strong hash value and generates a new block signal when the Bloom filter returns a non-existent signal, and forces the second and third levels of verification to be enabled. For files with a low security level, the processor directly generates a new block signal when the Bloom filter returns a non-existent signal, and only performs the second level of verification and skips the third level when it returns a present signal.

6. The file deduplication storage method based on hash algorithm according to claim 4, characterized in that, The hash index is divided into a hot index area, a warm index area, and a cold index area; the hot index area is stored in DRAM, the warm index area is stored in NVMe SSD, and the cold index area is stored in HDD; each record in the precise index table uses a strong hash value as the key and the physical address of the file block, the byte length of the file block, the reference count, the access frequency count, and the access timestamp as the value. After each verification conclusion signal is processed, inter-layer migration is performed on the record based on the access timestamp and the access frequency count.

7. The file deduplication storage method based on a hash algorithm according to claim 6, characterized in that, The rule for inter-layer migration is as follows: if the access timestamp of a record in the hot index area is more than a preset cooling threshold from the current time, then the record will be migrated to the warm index area. If the access timestamp of a record in the warm index area exceeds a preset archiving threshold from the current time, the record will be migrated to the cold index area; if the access frequency count of a record in the warm or cold index area exceeds a preset hotspot threshold, the record will be promoted to the hot index area.

8. The file deduplication storage method based on hash algorithm according to claim 4, characterized in that, Each record in the file block mapping table includes the offset position of the file block, the physical address of the file block, the byte length of the file block, and the strong hash value of the file block; the file metadata record includes the file's unique identifier, the total number of bytes in the file, the total number of file blocks, the storage address of the file block mapping table, and the file-level strong hash value; The file-level strong hash value is generated by concatenating the strong hash values of each file block in order and then performing a hash operation again. For a file deletion request, the processor directly retrieves the strong hash value from each record in the file block mapping table, looks up the corresponding record in the exact index table, and decrements the reference count by one. If the reference count is decremented by one and equals zero, the physical address of the corresponding file block is written to the delayed reclamation queue and the record in the exact index table is marked as pending reclamation.

9. The file deduplication storage method based on hash algorithm according to claim 1, characterized in that, Before each reference count change is executed, the processor appends a log entry containing a strong hash value, the reference count change, and the operation time to the write-ahead log file and forces it to disk. Then, it performs a reference count update for the exact index table. The reference count change is incremented by 1 when the reference count increases and decrements by 1 when the reference count decreases. When the system crashes and restarts, the recovery process reads the write-ahead log file and re-executes any incomplete reference count change operations. Physical blocks with reference counts of zero are physically deleted in batches by the background garbage collection process when the system I / O load is below a preset load threshold.

10. A file deduplication storage method based on a hash algorithm according to claim 1, characterized in that, The consistency self-check specifically involves: the processor traversing all records in the precise index table, reading the byte sequence of the corresponding physical block from the storage medium for each record and recalculating the strong hash value, comparing the calculation result with the strong hash value key stored in the precise index table, and generating an abnormal signal and marking the record as pending repair if they are inconsistent; the Bloom filter reconstruction specifically involves: the processor initializing a new Bloom filter, traversing all records in the hot index area with a reference count greater than zero and a status not pending reclamation, extracting the fast hash value and inserting it sequentially, and after reconstruction, using a read-write lock to protect the atomic replacement of the original Bloom filter with the reconstructed Bloom filter; If the number of fault entries is greater than zero, a repair signal is generated; if the estimated false positive rate of the Bloom filter exceeds the preset false positive rate limit, an expansion signal is generated.