A data processing method, device, apparatus, and computer-readable storage medium
By adaptively adjusting the deduplication granularity range and optimizing the data writing method, the problem of low data deduplication efficiency was solved, achieving efficient utilization of storage space and stability of data processing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INSPUR SUZHOU INTELLIGENT TECH CO LTD
- Filing Date
- 2024-11-27
- Publication Date
- 2026-06-23
AI Technical Summary
In existing technologies, data deduplication is inefficient, resulting in large storage space consumption, and the fixed deduplication granularity setting cannot achieve the optimal state.
By obtaining the deduplication granularity range, calculating the deduplication ratio within the deduplication granularity range, adjusting the deduplication granularity range to obtain the optimal deduplication granularity range, and deleting duplicate data in the storage space according to the optimal deduplication granularity, the data writing is optimized by combining data fingerprinting and metadata management system.
It significantly improves the deduplication ratio, saves storage space, optimizes deduplication performance, improves data storage and transmission efficiency, and ensures the stability and accuracy of data processing.
Smart Images

Figure CN119597220B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing, and in particular to a data processing method, apparatus, device, and computer-readable storage medium. Background Technology
[0002] In practical applications, deduplication technology typically employs a fixed granularity setting strategy. This single, fixed granularity setting method often fails to achieve optimal deduplication efficiency, resulting in large disk space usage due to the large amount of data being written to disk.
[0003] Therefore, improving the efficiency of data deduplication in storage space is an urgent problem to be solved. Summary of the Invention
[0004] In view of this, the purpose of the present invention is to provide a data processing method, apparatus, device and computer-readable storage medium, which solves the problem of low data deduplication efficiency in the prior art.
[0005] To solve the above-mentioned technical problems, the present invention provides a data processing method, comprising:
[0006] Get the deduplication granularity range;
[0007] Calculate the deduplication ratio corresponding to the deduplication granularity within the deduplication granularity range, and adjust the deduplication granularity range according to the deduplication ratio to obtain the optimal deduplication granularity range;
[0008] The optimal deduplication granularity is determined from the optimal deduplication granularity range, and duplicate data in the storage space is deleted according to the optimal deduplication granularity.
[0009] On the one hand, obtain the deduplication granularity range, including:
[0010] The minimum and maximum deduplication granularity are determined based on the current write granularity; the minimum deduplication granularity is smaller than the current write granularity.
[0011] The deduplication granularity range is determined based on the minimum deduplication granularity and the maximum deduplication granularity.
[0012] On the one hand, it also includes:
[0013] Obtain the data fingerprint of the data written to the storage space, and cache the data fingerprint in the storage space; the data fingerprint is set according to the data segment structure;
[0014] The remaining data after deletion in the storage space is written to the hard disk according to the data fingerprint and metadata management system.
[0015] On the one hand, based on the data fingerprint and metadata management system, the remaining data after deletion in the storage space is written to the hard disk, including:
[0016] When the hard drive is empty, the remaining data is written to the hard drive according to the data fingerprint;
[0017] When the hard disk is not empty, the remaining data is compared with the data in the hard disk based on the data fingerprint to determine the same data and different data in the remaining data;
[0018] The different data are written to the hard disk.
[0019] On the one hand, based on the data fingerprint and metadata management system, the remaining data after deletion in the storage space is written to the hard disk, including:
[0020] Obtain the number of writes, and divide the remaining data into hot data and cold data based on the number of writes;
[0021] The hot data is written to the center of the disk according to the data fingerprint, and the cold data is written to the edge of the disk according to the data fingerprint.
[0022] On the one hand, the deduplication ratio corresponding to the deduplication granularity within the deduplication granularity range is calculated, and the deduplication granularity range is adjusted according to the deduplication ratio to obtain the optimal deduplication granularity range, including:
[0023] Calculate the deduplication ratios corresponding to the minimum, maximum, and intermediate deduplication granularities within the deduplication granularity range, respectively.
[0024] Update the deduplication granularity range according to the deduplication ratio;
[0025] Repeat the steps of calculating the deduplication ratio at the minimum, maximum, and intermediate deduplication granularity in the deduplication granularity interval and updating the deduplication granularity interval according to the deduplication ratio, until the difference between the maximum and minimum deduplication granularity of the updated deduplication granularity interval meets a preset threshold, then stop updating to obtain the optimal deduplication granularity interval.
[0026] The step of updating the deduplication granularity range according to the deduplication ratio includes:
[0027] The deduplication granularity corresponding to the calculated maximum deduplication ratio is taken as the maximum deduplication granularity of the deduplication granularity interval, and the deduplication granularity corresponding to the calculated second largest deduplication ratio is taken as the minimum deduplication granularity of the deduplication granularity interval.
[0028] On the one hand, determining the optimal deduplication granularity from the optimal deduplication granularity range includes:
[0029] The median value of the optimal deduplication granularity range is taken as the optimal deduplication granularity.
[0030] The present invention also provides a data processing apparatus, comprising:
[0031] The deduplication ratio interval acquisition module is used to obtain the deduplication granularity interval;
[0032] The optimal deduplication granularity interval determination module is used to calculate the deduplication ratio corresponding to the deduplication granularity within the deduplication granularity interval, and adjust the deduplication granularity interval according to the deduplication ratio to obtain the optimal deduplication granularity interval.
[0033] The deduplication module is used to determine the optimal deduplication granularity from the optimal deduplication granularity range, and delete duplicate data in the storage space according to the optimal deduplication granularity.
[0034] The present invention also provides a data processing device, comprising:
[0035] Memory, used to store computer programs;
[0036] A processor for executing the computer program to implement the steps of the data processing method described above.
[0037] The present invention also provides a computer-readable storage medium storing computer-executable instructions, which, when loaded and executed by a processor, implement the steps of the data processing method described above.
[0038] The present invention also provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the steps of the data processing method described above.
[0039] As can be seen from the above technical solution, this invention obtains the deduplication granularity range; calculates the deduplication ratio corresponding to the deduplication granularity within the range, and adjusts the range based on the ratio to obtain the optimal range; determines the optimal granularity from the optimal range, and deletes duplicate data in the storage space based on the optimal granularity. The beneficial effects of this invention are: by adaptively adjusting the range based on the deduplication ratio to obtain the optimal range, and then further determining the optimal granularity from this range, data deduplication is performed using the optimal granularity. This flexible and accurate method of determining the deduplication granularity not only ensures the normal operation of deduplication but also significantly improves the deduplication ratio, saves storage space, and further optimizes deduplication performance.
[0040] In addition, the present invention also provides a data processing apparatus, device, and computer-readable storage medium, which also have the above-mentioned beneficial effects. Attached Figure Description
[0041] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.
[0042] Figure 1 A flowchart of a data processing method provided in an embodiment of the present invention;
[0043] Figure 2 A flowchart illustrating an iterative optimization process provided in this embodiment of the invention;
[0044] Figure 3 A flowchart illustrating a data processing method provided in an embodiment of the present invention;
[0045] Figure 4 A flowchart illustrating the process of writing data to disk is provided in an embodiment of the present invention.
[0046] Figure 5 This is a schematic diagram of the structure of a data processing device provided in an embodiment of the present invention;
[0047] Figure 6 This is a schematic diagram of the structure of a data processing device provided in an embodiment of the present invention. Detailed Implementation
[0048] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0049] First, let me explain some of the terms used in this application:
[0050] Deduplication granularity: refers to the smallest unit size used to identify and compare data blocks during the deduplication process.
[0051] Data fingerprint: A fixed-length, unique numerical value or string generated by processing raw data using a specific algorithm, used to ensure the integrity and uniqueness of information.
[0052] Metadata: Data used to describe data, providing various attributes, history, source, version and other information about the data.
[0053] In the storage field, deduplication technology has an extremely wide range of applications, and its importance is self-evident. This technology reduces the space required for data storage and lowers bandwidth consumption by identifying and eliminating redundant parts of the data, thereby effectively improving storage and data transmission efficiency. Data backup is one of the most common application scenarios for deduplication technology. During backup, data mainly undergoes sequential read and write processes. Because data backup rarely involves random data writing, deduplication technology can achieve its maximum efficiency in this scenario. Through data deduplication, redundant data can be quickly identified and removed, greatly simplifying the complexity of data storage and facilitating operations such as repeated backups. Furthermore, with the surge in data volume and the continuous expansion of storage demands, the demand for data transmission bandwidth also increases. In this context, efficient deduplication technology becomes particularly important. Therefore, efficient and accurate deduplication technology can not only significantly save storage space, enabling the system to back up more data and increase data storage capacity, but also effectively reduce bandwidth consumption and improve data transmission efficiency.
[0054] In practical applications, deduplication technology typically employs a fixed deduplication granularity. For example, most storage device manufacturers set a fixed deduplication granularity for users when initially creating stored data. However, due to constraints from multiple factors such as manufacturing processes, the degree of software algorithm optimization, and the actual data deduplication rate, this single, fixed granularity setting often fails to achieve optimal deduplication efficiency. Furthermore, this fixed deduplication granularity setting may lead to data output offsets and instability, thereby affecting the accuracy of test results.
[0055] To address the aforementioned problems, this invention provides a data processing method that can efficiently and accurately determine the optimal deduplication granularity. It should be noted that this method is applicable to devices involving storage arrays such as storage units and servers that utilize deduplication functionality. Please refer to [link / reference] for details. Figure 1 , Figure 1 A flowchart illustrating a data processing method provided in an embodiment of the present invention. The method may include:
[0056] S101: Get the deduplication granularity range.
[0057] The execution subject in this embodiment is a terminal. This embodiment does not limit the type of terminal, as long as it can perform the data processing method operations. It should be noted that the deduplication granularity range obtained in step S101 is the initial deduplication granularity range. This embodiment does not limit the specific method for determining the initial deduplication granularity range. For example, the initial deduplication granularity range can be determined based on front-end business data; or it can be determined based on the written data situation.
[0058] Furthermore, to ensure the accuracy of the initial deduplication granularity interval determination and improve the efficiency of subsequent optimal deduplication granularity interval determination, the above-mentioned method for obtaining the deduplication granularity interval may specifically include the following steps:
[0059] Step 11: Determine the minimum and maximum deduplication granularity based on the current write granularity; the minimum deduplication granularity is smaller than the current write granularity.
[0060] Step 12: Determine the deduplication granularity range based on the minimum and maximum deduplication granularity.
[0061] Specifically, in this embodiment, when writing data to the storage space, the write granularity at the time of data writing can be obtained, and the deduplication granularity range can be determined based on the current write granularity. For example, when the data write granularity is 8K, the minimum deduplication granularity of the initial deduplication granularity range can be set to 4K, and the maximum deduplication granularity can be set to 128K. It should be noted that, in order to improve the accuracy of deduplication, the deduplication granularity range should include both deduplication granularities larger than the write granularity and deduplication granularities smaller than the write granularity. Therefore, the minimum deduplication granularity of the deduplication granularity range should be smaller than the write granularity. When the write granularity changes, the deduplication granularity range determined based on the write granularity will also change accordingly.
[0062] S102: Calculate the deduplication ratio corresponding to the deduplication granularity within the deduplication granularity range, and adjust the deduplication granularity range according to the deduplication ratio to obtain the optimal deduplication granularity range.
[0063] In this embodiment, the deduplication ratio corresponding to each deduplication granularity is calculated based on the deduplication granularity interval obtained in step S101, and the deduplication granularity interval is adjusted according to the deduplication ratio to determine the optimal deduplication granularity interval. It can be understood that the deduplication ratio in this embodiment is the ratio of the amount of duplicate data deleted to the total amount of original data in the deduplication technology. This embodiment can calculate the deduplication ratio corresponding to each deduplication granularity within the deduplication granularity interval; it can also calculate the deduplication ratio corresponding to several deduplication granularities within the deduplication granularity interval.
[0064] Furthermore, to reduce computational resource consumption and improve efficiency, the deduplication ratio corresponding to the deduplication granularity within the aforementioned deduplication granularity range is calculated, and the deduplication granularity range is adjusted based on the deduplication ratio to obtain the optimal deduplication granularity range. Specifically, this may include:
[0065] Step 21: Calculate the deduplication ratios corresponding to the smallest, largest, and intermediate deduplication granularities within the deduplication granularity range, respectively;
[0066] Step 22: Update the deduplication granularity range based on the deduplication ratio;
[0067] Repeat steps 21 and 22 above until the difference between the maximum and minimum deduplication granularity of the updated deduplication granularity interval meets the preset threshold, then stop updating to obtain the optimal deduplication granularity interval.
[0068] Specifically, step 22 above updates the deduplication granularity range based on the deduplication ratio, and may include the following steps:
[0069] The deduplication granularity corresponding to the calculated maximum deduplication ratio is taken as the maximum deduplication granularity of the deduplication granularity interval, and the deduplication granularity corresponding to the calculated second largest deduplication ratio is taken as the minimum deduplication granularity of the deduplication granularity interval.
[0070] To better understand the process of determining the optimal deduplication granularity interval, consider the following example: Given a deduplication granularity interval of [a, b], calculate the deduplication ratios corresponding to a, b, and m (m = (a + b) / 2) and compare them. If the deduplication ratio of m > the deduplication ratio of a > the deduplication ratio of b, then update the deduplication granularity interval to [a, m]. Calculate the deduplication ratios corresponding to a, m, and m' (m' = (a + m) / 2) and compare them. Repeat this iterative process until two closest deduplication granularities c and d (d > c) are found, and their difference reaches a preset threshold. If the difference is less than 1k, stop the iteration, obtaining the optimal deduplication granularity interval [c, d]. This iterative update of the deduplication granularity interval reduces computation, saves computational resources, and improves efficiency compared to calculating all deduplication granularities within the interval.
[0071] Furthermore, to improve computational efficiency and reduce computational resources, after step 21: calculating the deduplication ratios corresponding to the smallest, largest, and intermediate deduplication granularities within the deduplication granularity range, the following may also be included:
[0072] Determine whether the maximum deduplication ratio among the deduplication ratios corresponding to the minimum deduplication granularity, maximum deduplication granularity, and intermediate deduplication granularity in the deduplication granularity range reaches the preset deduplication ratio threshold;
[0073] If so, then the deduplication granularity corresponding to the maximum deduplication ratio shall be taken as the optimal deduplication granularity;
[0074] If not, continue with the step of updating the deduplication granularity range based on the deduplication ratio.
[0075] For details, please refer to... Figure 2 , Figure 2This is a flowchart illustrating an iterative optimization process provided by an embodiment of the present invention. In this embodiment, a preset deduplication ratio threshold is added during the calculation of the deduplication ratio. The maximum deduplication ratio within a given interval is compared with the preset threshold. If it is greater than the threshold, the loop terminates early, and the deduplication granularity corresponding to the maximum deduplication ratio is taken as the optimal deduplication granularity. If it is less than the threshold, the interval is updated, the deduplication ratio of the new interval is recalculated, and the maximum deduplication ratio in the new interval is compared with the preset threshold, and so on. If no deduplication granularity greater than the preset threshold exists during the iterative optimization process, the optimal deduplication granularity is selected from the optimal deduplication granularity interval. This approach reduces the iterative optimization process, finds the optimal deduplication granularity, improves the efficiency of iterative optimization, saves computational resources, and maintains efficient and stable system operation.
[0076] S103: Determine the optimal deduplication granularity from the optimal deduplication granularity range, and delete duplicate data in the storage space according to the optimal deduplication granularity.
[0077] This embodiment does not limit the specific method for determining the optimal deduplication granularity based on the optimal deduplication granularity range. For example, the deduplication ratio corresponding to each deduplication granularity in the optimal deduplication granularity range can be calculated, and the deduplication granularity corresponding to the maximum deduplication ratio can be taken as the optimal deduplication granularity; or, a certain deduplication granularity in the optimal deduplication granularity range can be randomly selected as the optimal deduplication granularity.
[0078] Furthermore, to improve the efficiency and accuracy of determining the optimal deduplication granularity, the above-mentioned determination of the optimal deduplication granularity from the optimal deduplication granularity range may specifically include:
[0079] The median value of the optimal deduplication granularity range is taken as the optimal deduplication granularity.
[0080] It's understandable that the difference in deduplication granularity within the optimal deduplication granularity range is small, and the range itself is relatively small, resulting in smaller differences in deduplication ratios for each granularity. Therefore, the median value within this range can be directly used as the optimal deduplication granularity, which reduces computation while maintaining accuracy.
[0081] It should be noted that data deduplication technology can be performed either immediately upon data writing to storage or via a scheduled task after data has been written. Furthermore, the scheduled method is more conducive to the accuracy of determining the granularity of deduplication. Therefore, this embodiment can perform data deduplication on a scheduled basis to more accurately identify and remove duplicate data. This is particularly effective for long data. Furthermore, the data can be segmented after writing, and then deduplication processing can be performed on the segmented data blocks with finer granularity for even better results.
[0082] Furthermore, the above method may also include:
[0083] Generate data analysis reports at various deduplication granularities. The data analysis reports should include at least the statistics on write granularity, deduplication ratio, storage capacity saved, and amount of deduplicated data within a preset time period.
[0084] Specifically, detailed data analysis reports can be generated at each deduplication granularity. These reports will include statistics on data write granularity, deduplication ratio statistics, and information such as storage capacity saved and amount of deduplicated data after deduplication, in order to conduct comprehensive analysis and comparison.
[0085] Furthermore, after step S103, the above method may further include the following steps:
[0086] Step 31: Obtain the data fingerprint of the data written to the storage space and cache the data fingerprint in the storage space; the data fingerprint is set according to the data segment structure;
[0087] Step 32: Based on the data fingerprint and metadata management system, write the remaining data after the storage space is deleted to the hard drive.
[0088] For details, please refer to... Figure 3 , Figure 3 This is a flowchart illustrating a data processing method provided in an embodiment of the present invention. In this embodiment, when writing data to storage space, the corresponding data fingerprint is also cached in storage space for use when writing data to disk later. Through the aforementioned effective data deduplication technology, disk space can be saved when writing data to disk. It should be noted that step 31 is the data fingerprint caching prefetching process. In this embodiment, the data fingerprint is set according to the data segment structure. Because the data segment structures of the data written to storage space are inconsistent, they can be used as data fingerprints. The data reliability and maintainability are improved through the normalization, standardization, and unification management of data by the metadata management system. The metadata management system can also perform normalized management of data fingerprints to ensure data consistency and accuracy, and facilitate data tracking and auditing, determining the integrity and authenticity of the data. The metadata management system can also provide security measures such as data encryption and access control to protect data security and privacy. This embodiment can utilize digital fingerprint technology and a metadata management system to ensure the integrity, authenticity, and traceability of data, further improving the efficiency and reliability of data processing. The metadata management system leverages the locality of data by introducing data fingerprint caching prefetching, which reduces data interference, enables ordered disk write-downs, facilitates subsequent data retrieval, and improves disk write-down speed, resulting in a significant increase in the throughput of deduplicated data.
[0089] Furthermore, step 32 above, based on the data fingerprint and metadata management system, writes the remaining data after deletion from the storage space to the hard disk, which may specifically include the following steps:
[0090] When the hard drive is empty, the remaining data is written to the hard drive based on the data fingerprint.
[0091] When the hard drive is not empty, the remaining data is compared with the data on the hard drive based on the data fingerprint to determine the same data and different data in the remaining data.
[0092] Write different data to the hard drive.
[0093] Specifically, when the hard drive is empty, it means the hard drive has sufficient capacity but no data. In this case, all remaining data after deduplication needs to be written to the disk. When there is data on the hard drive, similarity comparison can be performed based on the data fingerprints prefetched from the cache. For details, please refer to... Figure 4 , Figure 4 This is a flowchart illustrating a data write-to-disk process according to an embodiment of the present invention. The remaining data in the storage space is compared with the data on the hard disk for similarity. If identical data is found, the complete data is quickly located using the original marked position, and the write count is recorded. If not found, the data is written to the hard disk, and the write count is recorded.
[0094] Furthermore, step 32 above, based on the data fingerprint and metadata management system, writes the remaining data after deletion from the storage space to the hard disk, which may specifically include the following steps:
[0095] Step 321: Obtain the number of writes and divide the remaining data into hot data and cold data based on the number of writes;
[0096] Step 322: Write hot data to the center of the disk based on the data fingerprint, and write cold data to the edge of the disk based on the data fingerprint.
[0097] Specifically, hot data refers to data accessed frequently and critical to business and applications; cold data refers to data accessed less frequently and less important to business and applications. This embodiment can classify data into hot and cold data based on data write activity. For example, using a 24-hour cycle, hot and cold data are distinguished based on the number of times data is written. Frequently read and written hot data is stored on the hard drive closest to the data, while less frequently read and written cold data is stored at the edge. This can significantly improve the write speed of data deduplication. When data needs to be read from the hard drive, data with similar fingerprints can be quickly found from the more frequently read locations. This mechanism not only improves deduplication efficiency but also speeds up data flushing, fully utilizing the advantages of hard drive storage location and similarity comparison, as well as the effect of recording the number of writes on improving data management efficiency.
[0098] The data processing method provided in this invention involves: obtaining a deduplication granularity range; calculating the deduplication ratio corresponding to each deduplication granularity within the range; adjusting the range based on the ratio to obtain an optimal range; determining the optimal granularity from this range; and deleting duplicate data from the storage space based on the optimal granularity. This invention obtains the optimal range by adaptively adjusting the deduplication granularity range based on the deduplication ratio, then further determines the optimal granularity from this range, and finally uses the optimal granularity for data deduplication. This flexible and accurate method for determining the deduplication granularity not only ensures the normal operation of data deduplication but also significantly improves the deduplication ratio, saves storage space, further optimizes deduplication performance, and enables rapid data write-to-disk. Furthermore, this embodiment uses write counts to classify data into hot and cold categories, and the statistics of hot and cold data facilitate rapid data write-to-disk flushing. Additionally, caching the local features (data fingerprints) of prefetched data reduces data interference, enabling fast and orderly data write-to-disk and improving the throughput of deduplicated data. Moreover, generating periodic reports allows for in-depth analysis of business and application data, facilitating the analysis, organization, and optimization of deduplication granularity. This method enables larger-scale and more efficient data storage, further reducing costs and increasing efficiency.
[0099] The data processing method and apparatus provided in the embodiments of the present invention will be described below. The data processing method and apparatus described below can be referred to in correspondence with the data processing method described above.
[0100] Please refer to the details. Figure 5 , Figure 5 A schematic diagram of a data processing method apparatus provided in an embodiment of the present invention may include:
[0101] The deduplication ratio range acquisition module 100 is used to acquire the deduplication granularity range;
[0102] The optimal deduplication granularity interval determination module 200 is used to calculate the deduplication ratio corresponding to the deduplication granularity within the deduplication granularity interval, and adjust the deduplication granularity interval according to the deduplication ratio to obtain the optimal deduplication granularity interval.
[0103] The deduplication module 300 is used to determine the optimal deduplication granularity from the optimal deduplication granularity range, and delete duplicate data in the storage space according to the optimal deduplication granularity.
[0104] Based on the above embodiments, the deduplication ratio interval acquisition module 100 may include:
[0105] A granularity determination unit is used to determine the minimum deduplication granularity and the maximum deduplication granularity based on the current write granularity; the minimum deduplication granularity is smaller than the current write granularity;
[0106] The interval determination unit is used to determine the deduplication granularity interval based on the minimum deduplication granularity and the maximum deduplication granularity.
[0107] Based on any of the above embodiments, the optimal deduplication granularity range determination module 200 may include:
[0108] The deduplication ratio calculation unit is used to calculate the deduplication ratio corresponding to the minimum deduplication granularity, the maximum deduplication granularity, and the intermediate deduplication granularity in the deduplication granularity range, respectively.
[0109] An interval update unit is used to update the deduplication granularity interval according to the deduplication ratio;
[0110] An iterative unit is used to repeatedly execute the steps of calculating the deduplication ratio at the minimum deduplication granularity, maximum deduplication granularity, and intermediate deduplication granularity in the deduplication granularity interval, and updating the deduplication granularity interval according to the deduplication ratio, until the difference between the maximum deduplication granularity and the minimum deduplication granularity of the updated deduplication granularity interval meets a preset threshold, and then stop updating to obtain the optimal deduplication granularity interval.
[0111] The interval update unit may include:
[0112] The interval update sub-unit is used to take the deduplication granularity corresponding to the calculated maximum deduplication ratio as the maximum deduplication granularity of the deduplication granularity interval, and take the deduplication granularity corresponding to the calculated second largest deduplication ratio as the minimum deduplication granularity of the deduplication granularity interval.
[0113] Based on the above embodiments, the deduplication module 300 may include:
[0114] The optimal deduplication granularity determination unit is used to take the median value of the optimal deduplication granularity range as the optimal deduplication granularity.
[0115] Based on the above embodiments, the data processing apparatus may further include:
[0116] A data fingerprint processing module is used to acquire the data fingerprint of the data written to the storage space and cache the data fingerprint in the storage space; the data fingerprint is set according to the data segment structure;
[0117] The data write-to-disk module is used to write the remaining data after deletion in the storage space to the hard disk based on the data fingerprint and metadata management system.
[0118] Based on the above embodiments, the data write-to-disk module may include:
[0119] The first writing unit is used to write all the remaining data to the hard disk according to the data fingerprint when the hard disk is empty;
[0120] The second writing unit is used to compare the remaining data with the data in the hard disk based on the data fingerprint when the hard disk is not empty, to determine the same data and different data in the remaining data, and to write the different data into the hard disk.
[0121] Based on the above embodiments, the data write-to-disk module may include:
[0122] A data partitioning unit is used to obtain the number of writes and divide the remaining data into hot data and cold data based on the number of writes.
[0123] The third writing unit is used to write the hot data to the center of the disk according to the data fingerprint, and to write the cold data to the edge of the disk according to the data fingerprint.
[0124] It should be noted that the order of the modules and units in the above data processing device can be changed without affecting the logic.
[0125] The data processing apparatus provided in this embodiment of the invention employs a deduplication ratio interval acquisition module 100 to acquire a deduplication granularity interval; an optimal deduplication granularity interval determination module 200 to calculate the deduplication ratio corresponding to the deduplication granularity within the deduplication granularity interval, and adjust the deduplication granularity interval according to the deduplication ratio to obtain an optimal deduplication granularity interval; and a deduplication module 300 to determine the optimal deduplication granularity from the optimal deduplication granularity interval, and delete duplicate data in the storage space according to the optimal deduplication granularity. This invention obtains the optimal deduplication granularity interval by adaptively adjusting the deduplication granularity interval based on the deduplication ratio, and further determines the optimal deduplication granularity from the optimal deduplication granularity interval, using the optimal deduplication granularity for data deduplication. This flexible and accurate deduplication granularity determination method not only ensures the normal operation of data deduplication but also significantly improves the deduplication ratio, saves storage space, further optimizes deduplication performance, and quickly persists data to disk. Furthermore, this embodiment uses write counts to classify data into hot and cold categories, and the statistics of hot and cold data facilitate rapid data write-to-disk flushing. Additionally, caching the local features (data fingerprints) of prefetched data reduces data interference, enabling fast and orderly data write-to-disk and improving the throughput of deduplicated data. Moreover, generating periodic reports allows for in-depth analysis of business and application data, facilitating the analysis, organization, and optimization of deduplication granularity. This method enables larger-scale and more efficient data storage, further reducing costs and increasing efficiency.
[0126] Figure 6 This is a schematic diagram of the structure of a data processing device provided in an embodiment of the present invention, such as... Figure 6 As shown, the data processing device includes:
[0127] Memory 60 is used to store computer programs;
[0128] The processor 61 is used to implement the steps of the data processing method as described in the above embodiments when executing a computer program.
[0129] The data processing device provided in this embodiment may include, but is not limited to, smartphones, tablets, laptops, or desktop computers.
[0130] The processor 61 may include one or more processing cores, such as a quad-core processor or an octa-core processor. The processor 61 may be implemented using at least one hardware form selected from Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 61 may also include a main processor and a coprocessor. The main processor, also known as the Central Processing Unit (CPU), is used to process data in the wake-up state; the coprocessor is a low-power processor used to process data in the standby state. In some embodiments, the processor 61 may integrate a Graphics Processing Unit (GPU), which is responsible for rendering and drawing the content to be displayed on the screen. In some embodiments, the processor 61 may also include an Artificial Intelligence (AI) processor, which handles computational operations related to machine learning.
[0131] The memory 60 may include one or more computer-readable storage media, which may be non-transitory. The memory 60 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices or flash memory devices. In this embodiment, the memory 60 is used to store at least the following computer program 601, which, after being loaded and executed by the processor 61, is capable of implementing the relevant steps of the data processing method disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 60 may also include an operating system 602 and data 603, and the storage method may be temporary or permanent storage. The operating system 602 may include Windows, Unix, Linux, etc. The data 603 may include, but is not limited to, data related to the data processing method.
[0132] In some embodiments, the data processing device may further include a display screen 62, an input / output interface 63, a communication interface 64, a power supply 65, and a communication bus 66.
[0133] Those skilled in the art will understand that Figure 6 The structure shown does not constitute a limitation on the data processing device and may include more or fewer components than illustrated.
[0134] It is understood that if the data processing methods in the above embodiments are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the current technology, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and executes all or part of the steps of the methods in the various embodiments of the present invention. The aforementioned storage medium includes: USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, magnetic disks, or optical disks, and other media capable of storing program code.
[0135] Based on this, embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the data processing method described above.
[0136] The following describes a computer program product provided by an embodiment of this application. The computer program product described below can be referred to in conjunction with other embodiments described herein.
[0137] A computer program product includes a computer program / instructions that, when executed by a processor, implement the steps of the aforementioned disclosed data processing method.
[0138] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to in the method section.
[0139] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.
[0140] Finally, it should be noted that in this document, relationships such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus.
[0141] The above provides a detailed description of a data processing method, apparatus, device, and computer-readable storage medium provided by the present invention. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The description of the above embodiments is only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, for those skilled in the art, there will be changes in specific implementation methods and application scope based on the ideas of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A data processing method, characterized by, include: Get the deduplication granularity range; Calculate the deduplication ratio corresponding to the deduplication granularity within the deduplication granularity range, and adjust the deduplication granularity range according to the deduplication ratio to obtain the optimal deduplication granularity range; The optimal deduplication granularity is determined from the optimal deduplication granularity range, and duplicate data in the storage space is deleted according to the optimal deduplication granularity. Calculate the deduplication ratio corresponding to the deduplication granularity within the deduplication granularity range, and adjust the deduplication granularity range according to the deduplication ratio to obtain the optimal deduplication granularity range, including: Calculate the deduplication ratios corresponding to the minimum, maximum, and intermediate deduplication granularities within the deduplication granularity range, respectively. Update the deduplication granularity range according to the deduplication ratio; Repeat the steps of calculating the deduplication ratio at the minimum, maximum, and intermediate deduplication granularity in the deduplication granularity interval and updating the deduplication granularity interval according to the deduplication ratio, until the difference between the maximum and minimum deduplication granularity of the updated deduplication granularity interval meets a preset threshold, then stop updating to obtain the optimal deduplication granularity interval. The step of updating the deduplication granularity range according to the deduplication ratio includes: The deduplication granularity corresponding to the calculated maximum deduplication ratio is taken as the maximum deduplication granularity of the deduplication granularity interval, and the deduplication granularity corresponding to the calculated second largest deduplication ratio is taken as the minimum deduplication granularity of the deduplication granularity interval.
2. The data processing method according to claim 1, characterized in that, Get the deduplication granularity range, including: The minimum and maximum deduplication granularity are determined based on the current write granularity; the minimum deduplication granularity is smaller than the current write granularity. The deduplication granularity range is determined based on the minimum deduplication granularity and the maximum deduplication granularity.
3. The data processing method according to claim 1, characterized in that, Also includes: Obtain the data fingerprint of the data written to the storage space, and cache the data fingerprint in the storage space; The data fingerprint is set according to the data segment structure; The remaining data after deletion in the storage space is written to the hard disk according to the data fingerprint and metadata management system.
4. The data processing method according to claim 3, characterized in that, The remaining data after deletion in the storage space is written to the hard disk according to the data fingerprint and metadata management system, including: When the hard drive is empty, the remaining data is written to the hard drive according to the data fingerprint; When the hard disk is not empty, the remaining data is compared with the data in the hard disk based on the data fingerprint to determine the same data and different data in the remaining data; The different data are written to the hard disk.
5. The data processing method according to claim 3, characterized in that, The remaining data after deletion in the storage space is written to the hard disk according to the data fingerprint and metadata management system, including: Obtain the number of writes, and divide the remaining data into hot data and cold data based on the number of writes; The hot data is written to the center of the disk according to the data fingerprint, and the cold data is written to the edge of the disk according to the data fingerprint.
6. The data processing method according to claim 1, characterized in that, Determining the optimal deduplication granularity from the optimal deduplication granularity range includes: The median value of the optimal deduplication granularity range is taken as the optimal deduplication granularity.
7. A data processing apparatus, characterized in that, include: The deduplication ratio interval acquisition module is used to obtain the deduplication granularity interval; The optimal deduplication granularity interval determination module is used to calculate the deduplication ratio corresponding to the deduplication granularity within the deduplication granularity interval, and adjust the deduplication granularity interval according to the deduplication ratio to obtain the optimal deduplication granularity interval. The deduplication module is used to determine the optimal deduplication granularity from the optimal deduplication granularity range, and delete duplicate data in the storage space according to the optimal deduplication granularity; The optimal deduplication granularity range determination module includes: The deduplication ratio calculation unit is used to calculate the deduplication ratio corresponding to the minimum deduplication granularity, the maximum deduplication granularity, and the intermediate deduplication granularity in the deduplication granularity range, respectively. An interval update unit is used to update the deduplication granularity interval according to the deduplication ratio; An iterative unit is used to repeatedly execute the steps of calculating the deduplication ratio at the minimum deduplication granularity, maximum deduplication granularity, and intermediate deduplication granularity in the deduplication granularity interval, and updating the deduplication granularity interval according to the deduplication ratio, until the difference between the maximum deduplication granularity and the minimum deduplication granularity of the updated deduplication granularity interval meets a preset threshold, and then stop updating to obtain the optimal deduplication granularity interval. The interval update unit includes: The interval update sub-unit is used to take the deduplication granularity corresponding to the calculated maximum deduplication ratio as the maximum deduplication granularity of the deduplication granularity interval, and take the deduplication granularity corresponding to the calculated second largest deduplication ratio as the minimum deduplication granularity of the deduplication granularity interval.
8. A data processing device, characterized in that, include: Memory, used to store computer programs; A processor for executing the computer program to implement the steps of the data processing method as described in any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when loaded and executed by a processor, implement the steps of the data processing method as described in any one of claims 1 to 6.