File semantics and system real-time state based redundant data deduplication method

A technology of redundant data and file semantics, applied in electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of read performance impact, read operation response delay, data read performance loss, etc., to save storage space, Maintain efficient responsiveness and increase flexibility

Active Publication Date: 2015-11-18
ZHEJIANG UNIV
3 Cites 8 Cited by

AI-Extracted Technical Summary

Problems solved by technology

Although these two solutions can meet certain deduplication requirements to a certain extent, their design does not consider the balance and trade-off between the application scenario of the primary storage system and other factors.
First of all, the data deduplication technology will cause the loss of data reading performance. GFD may cause the data accessed by the user to be transmitted from the remote server, which will cause delay...
View more

Abstract

The invention discloses a file semantics and system real-time state based redundant data deduplication method. The method is mainly realized by three functional modules: a multi-semantic dimension division based deduplication priority calculation module (an MPD module), a hierarchical data deduplication module (a deduplicator) and a system real-time state based deduplication control module (a controller). The MPD module is based on multi-dimensional file semantics and outputs a file object subjected to deduplication operation in priority; the deduplicator, according to the output, sequentially executes hierarchical deduplication policies including global file-level deduplication and local chunk-level deduplication; and at the same time, in the operational process of the deduplicator, the controller, according to the system real-time state, dynamically adjusts the deduplicator so as to reduce more storage space costs while ensuring the read request response performance of a distributed main storage system.

Application Domain

Input/output to record carriersSpecial data processing applications

Technology Topic

Data deduplicationFunctional module +4

Image

  • File semantics and system real-time state based redundant data deduplication method
  • File semantics and system real-time state based redundant data deduplication method
  • File semantics and system real-time state based redundant data deduplication method

Examples

  • Experimental program(1)

Example Embodiment

[0068] In order to describe the present invention in more detail, the technical solution of the present invention will be described in detail below with reference to the drawings and specific embodiments.
[0069] Such as figure 1 As shown, in the actual running application environment, the redundant data deduplication scheme of the present invention based on file semantics and real-time state of the system runs in a general distributed main storage system, and the storage system mainly includes a metadata server and an object storage server cluster; among them:
[0070] The metadata server is responsible for receiving user requests and directing the requests to the corresponding object storage server. It is also responsible for detecting the running status of the entire distributed primary storage system, and maintaining a global index based on file names. The index contains the "signature", location information and metadata information of each file. This signature is unique to each file and is calculated on the binary content of the file through the SHA-1 algorithm.
[0071] The object storage server cluster contains multiple independent object storage servers. Each server saves a certain amount of files. The local disk of the server supports storage of file data in data blocks. Each object storage server Responsible for maintaining the index information in units of data blocks of files stored locally. Similarly, the index contains the physical location of the data block on the local disk and the signature calculated by the SHA-1 algorithm based on the binary content of the data block.
[0072] This embodiment mainly includes three functional modules: a deduplication priority calculation module (MPD module) based on multi-semantic dimension division, a hierarchical data deduplication module (deduplication device), and a deduplication control based on the real-time state of the system Module (controller). These three functional modules all run on the metadata server or the object storage server, and can interact and operate with the global file index of the metadata server and the data block index on each object storage server. Among them:
[0073] The MPD module runs on the metadata server. It calculates a list of files with high priority of deduplication by obtaining the files and auxiliary information stored on the metadata server. The MPD module is based on multi-dimensional file semantics, periodic Output a list, which contains a fixed number of file location records. These files are the objects that the deduplication device should prioritize. The MPD module of this program performs deduplication priority calculation based on the three-dimensional semantics of the file: the file's most recent access timestamp, the file size, and the file type. On this basis, the MPD module makes a clear assignment for the specific value range in each dimension, so that the attribute value of each file in these three dimensions corresponds to a clear value, and this data indicates a file For the deduplication priority in this dimension, the larger the value, the more priority the deduplication operation is. Moreover, these three dimensions have different weight coefficients, and the final deduplication priority of the file is the weight of the corresponding values ​​in these three dimensions. In practical applications, these assignments can be customized according to specific service scenarios and requirements. The default division standard and corresponding assignments of this embodiment, as well as the weight assignments of different dimensions are shown in Table 1:
[0074] Table 1
[0075]
[0076] According to the values ​​in Table 1, the final deduplication priority value of each file=the most recent access time priority value*0.5+file size priority value*0.3+file type priority value*0.2. The MPD module periodically scans the file index in the metadata server and includes the information of the top 50 files with the highest deduplication priority into a list, which is called the deduplication candidate list. Each file information occupies a row of the table, and each row also has an attached flag bit. When initialized, each flag bit is "dirty", which means that the row of files has not been processed by the deduplicator; correspondingly, If the deduplicator has processed the file of the line, the flag is changed to "clean".
[0077] The deduplication device runs on each object storage server. It receives the list output by the MPD module as input, and deduplicates the files in the list hierarchically. The deduplication device will perform double-layer deduplication operations, which are the global file-level deduplication GFD and the local data block-level deduplication LCD. The deduplication device periodically obtains the latest deduplication candidate list from the MPD module, and then starts the deduplication operation from the file with the largest deduplication priority value. If allowed by the controller, the deduplication device will first perform GFD operations on the file. In order to find out whether there are redundant files in the global scope, the local deduplication device will initiate a query request to the metadata server, because The global file index is stored in the metadata server, so it can be known whether the file has different redundant backups distributed on different object storage servers. If a redundant file backup is found after the query, the current deduplication device will send a deduplication request to the object server where the redundant file is located, and the other object server will notify its local deduplication device to remove the redundant file from the local disk. Deleted on. After knowing that the redundant file has been deleted, the deduplication device that initiated the deduplication request will set the flag of the file record in the deduplication priority list to "clean", and notify the metadata server to update the file index, which will just be deleted The location information of the file is directed to the object storage server where the deduplicator is located.
[0078] When the deduplication device traverses the priority list of deduplication in GFD mode, it will traverse the list again in LCD mode from the beginning if the controller allows it. The deduplicator implements the LCD by retrieving the index of the data block on the local object memory to determine whether there is a data block that is redundant with a certain or some of the current file blocks locally. If there are redundant data blocks, delete the redundant blocks, update the index of the local object storage server accordingly, and mark the corresponding file item flag in the deduplication candidate list as "clean". The LCD process also traverses the deduplication candidate list. After the LCD process also traverses the deduplication candidate list, the local deduplication device notifies the metadata server that the deduplication priority list for this period has been processed.
[0079] During the operation of the deduplication device, the controller will dynamically adjust the deduplication strategy of the deduplication device according to the real-time status of the system. The controller is a distributed component, which runs on the metadata server and each object storage server at the same time. During the operation, the controller will dynamically adjust the deduplication strategy of the deduplication device according to the real-time status of the system Adjustment. The part on the metadata server is responsible for monitoring the delay status of the response to read requests of the entire distributed primary storage system, and the part running on each object storage server is responsible for monitoring the storage space occupancy of the server. The controller is based on a set of priorities that can be set by the user in advance The demand is right, to dynamically adjust the deduplicator. We call the above set of hierarchical demand pairs with priority levels as a service level agreement on behalf of SLA. This embodiment provides a fixed format of the SLA, so the SLA can be refined and set by the user according to this format, so as to be used as the input of the controller; the SLA of the example format is shown in Table 2:
[0080] Table 2
[0081]
[0082] When the read request from the user arrives at the metadata server, the metadata server uses the time stamp at this time as the start response time stamp corresponding to the read request, inserts the time stamp into the request message, and then forwards the request to the corresponding On the object storage server. The object storage server that receives the read request completely reads the object corresponding to the read request, and when the last data block starts to be sent to the user end, the time stamp at this time is used as the end time stamp of the read request. The length of time between the start timestamp and the end timestamp is the response delay of the request. This delay is captured by the controller sub-components distributed on each object storage server and sent to the controller component located on the metadata server. The controller component located on the metadata server collects the control from each object storage server The response delay of each read request sent by the device component, and then every fixed period T, calculate the average value of all read response delays in the period T, and this value is used as the read response in the period Delay reference value.
[0083] The deduplication ratio is the ratio of the volume of the removed redundant data to the total volume of the stored data before deduplication. After the deduplication device located in each object storage server completes the GFD or LCD deduplication of a file, it sets the flag in the deduplication priority list to "clean", and then records in the file item Then add the volume of the removed redundant data. When the metadata server reclaims each deduplication priority list that has been traversed, it calculates the size of the removed data in the list from the table, and divides it by the entire distribution that can be directly obtained from the global file index Based on the size of all files in the main storage system, the real-time deduplication ratio is obtained.
[0084] The controller located in the metadata server will periodically detect and query the read response delay value and deduplication ratio of the entire distributed primary storage system in each cycle T, and then combine the specific requirements in the SLA to perform the deduplication Dynamic adjustment, such as figure 2 As shown, the main operations include the following processes:
[0085] The controller starts from the SLA with the lowest priority level. If the current read response delay value meets or is better than the current SLA requirement range, and the instant deduplication ratio meets or exceeds the upper limit of the current level deduplication ratio, deduplication is allowed The device continues to work normally and raises the current SLA priority level by one level;
[0086] If the current read response delay value meets or exceeds the requirement range of the current level, but the instant deduplication ratio is lower than the lower limit of the current level deduplication ratio, the deduplication device is allowed to continue to work normally and the current SLA is retained current level;
[0087] If the current read response delay value does not meet the current demand range, there are two cases: a) If the current read response delay value is greater than 1.1 times the upper limit of the read delay demand range of the current level, the controller stops the deduplication For all operations, if the read response delay of the system in three consecutive cycles has not fallen to a range less than 1.1 times the upper limit of the read delay demand range of the current level, then the current SLA level is lowered by one level; b) if The current read response delay value is greater than the upper limit of the current demand range but less than 1.1 times the upper limit of the demand, the controller still stays at the demand level, and sends an instruction to the deduplication device to stop the LCD deduplication operation and only retain the GFD operation.
[0088] The foregoing description of the embodiments is for the convenience of those of ordinary skill in the art to understand and apply the present invention. Those skilled in the art can obviously easily make various modifications to the above-mentioned embodiments, and apply the general principles described here to other embodiments without creative work. Therefore, the present invention is not limited to the above-mentioned embodiments. According to the disclosure of the present invention, those skilled in the art should make improvements and modifications to the present invention within the protection scope of the present invention.

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.

Similar technology patents

Data cleaning method for data files and data files processing method

InactiveCN104361064AReduce processing burdensave storage space
Owner:BANK OF CHINA

Data central processing method and system applicable to Internet of Things

ActiveCN104955042Aimprove congestionsave storage space
Owner:深圳市威矽星通信技术有限公司

Blockchain-based big data analysis method and system

InactiveCN108509514Asave storage spaceavoid too large
Owner:史玉成

Fast image retrieval method based on HASH algorithm of SIFT

PendingCN108182205AImprove retrieval speedsave storage space
Owner:NANJING UNIV OF INFORMATION SCI & TECH

Classification and recommendation of technical efficacy words

  • save storage space

Method of encoding structured low density check code

InactiveCN101141133AGood frame error rate performancesave storage space
Owner:BEIJING UNIV OF POSTS & TELECOMM +1

De-block effect filtering device and method

InactiveCN101409833Asave storage spaceloose timing
Owner:昆山杰得微电子有限公司

Distributed cache method and system

ActiveCN103019960AAvoid redundant storagesave storage space
Owner:浙江杭海新城控股集团有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products