Determination of mirror image duplicate checking dictionary, mirror image file storage method and device
By dividing the image file into data blocks and using the data block information for clustering, an image deduplication dictionary is constructed, which solves the problem of redundant data in container image storage and achieves efficient image file deduplication and storage.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ALIPAY (HANGZHOU) INFORMATION TECH CO LTD
- Filing Date
- 2022-08-18
- Publication Date
- 2026-06-16
Smart Images

Figure CN115344532B_ABST
Abstract
Description
Technical Field
[0001] This specification relates to one or more embodiments in the field of computer technology, and in particular to a method and apparatus for determining a mirror deduplication dictionary and storing mirror files. Background Technology
[0002] Deduplication refers to deleting or merging duplicate data. By deduplication, the amount of storage media required can be reduced, thereby lowering storage and computing costs. Container technology can effectively divide the resources of a single operating system into isolated groups to better balance conflicting resource usage demands among these groups. For example, virtual machines and Docker are specific implementations of container technology. Container startup often depends on images. For instance, most virtual machine disk data is stored as images on local or network storage. Each image typically occupies tens of gigabytes or more of disk space. Most virtual machines may have the same operating system installed, or they may be clones of the same virtual machine. Therefore, most images contain the same data, with each data set stored in its own image, resulting in excessive redundant data in the storage system and severely impacting storage and usage efficiency. Summary of the Invention
[0003] This specification describes one or more embodiments of a method and apparatus for determining a mirror deduplication dictionary and storing mirror files, in order to solve one or more problems mentioned in the background art.
[0004] According to the first aspect, a method for determining a mirror deduplication dictionary is provided, wherein the mirror file corresponding to a single mirror is stored in the form of at least one data block, and each data block corresponds to data block information uniquely determined according to a predetermined method, and the data block information is recorded in the data block information set of the corresponding mirror file. The method includes: obtaining the data block information sets corresponding to several mirrors, including a first mirror; clustering the several mirrors based on the similarity of the data block information sets to obtain at least one mirror category, wherein the first mirror corresponds to a first mirror category among the at least one mirror categories; and determining the occurrence frequency of each data block for the first mirror category. The corresponding first category of deduplication set, wherein the frequency of occurrence of a single data block is determined based on the data block information in the data block information set, and the corresponding data block information is added to the first category of deduplication set when the frequency of occurrence of a single data block in at least one image corresponding to the first image category meets a first predetermined condition; when excluding the data blocks indicated by the first category of deduplication set, a first predicted deduplication set is determined for the first image according to the importance of other data blocks; a data block deduplication dictionary is constructed using the first category of deduplication set and the first predicted deduplication set for deduplication during the image file update process of subsequent versions of the first image.
[0005] In one embodiment, the first image corresponds to multiple versions of image files, and the step of determining the first predictive deduplication set for the first image based on the importance of other data blocks includes: determining the importance scores of each data block according to the time order of each version of the image file based on the subset of data block information corresponding to each version of the image file; and selecting the corresponding data block information to be added to the corresponding predictive deduplication set based on the magnitude of each importance score.
[0006] In one embodiment, determining the importance score corresponding to each data block according to the time order of each version includes: for a single data block, based on the exponential smoothing formula, iteratively weighting its frequency of occurrence in each version according to the time order of the version to obtain the corresponding importance score. In the iterative weighting process, the initial score is 0, the weight of the accumulated score is a, the weight of the current version is 1-a, where a is a number greater than 0.5 and less than 1. If the data block appears in the current version of the image file, its frequency of occurrence is 1; otherwise, it is 0.
[0007] In one embodiment, the predetermined method is: processing a single data block according to a predetermined hash algorithm, and the corresponding data block information is the hash value obtained after processing.
[0008] In one embodiment, when the first image corresponds to multiple versions of image files, obtaining the data block information sets corresponding to each image, including the first image, includes: obtaining the corresponding data block information sets for the latest multiple versions of the first image.
[0009] In one embodiment, the similarity between pairs of data block information sets is determined by the ratio of the number of identical data block information entries to the total number of data block information entries.
[0010] In one embodiment, clustering the plurality of mirror images based on the similarity of data block information sets to obtain at least one mirror image category includes: determining the distance between each pair of mirror images based on the similarity of data block information sets, wherein the similarity between each pair of data block information sets is negatively correlated with the distance between the corresponding pairs of mirror images; and clustering each mirror image according to the distance between each pair of mirror images using a distance-related clustering method to obtain the at least one mirror image category.
[0011] In one embodiment, the clustering method is density-based clustering of noisy application spaces, implemented using one of the following algorithms: DBSCAN, OPTICS, or DENCLUE.
[0012] According to a second aspect, a method for storing an image file is provided, comprising: dividing a target image file to be stored into multiple data blocks and determining a corresponding first data block information set, wherein the first data block information set includes data block information that uniquely describes each data block according to a predetermined method; matching the first data block information set with elements in a data block deduplication dictionary of the target image, wherein the data block deduplication dictionary includes: a second category deduplication set based on a second image category corresponding to the target image, and a second predicted deduplication set determined based on historical versions of the target image; and storing the image file to be stored according to the matching result.
[0013] In one embodiment, storing the image file to be stored according to the matching result includes: if a matching element exists, filtering out the data block indicated by the matching element from the plurality of data blocks and storing the remaining data blocks; otherwise, storing all of the plurality of data blocks.
[0014] In one embodiment, storing the image file to be stored according to the matching result includes: storing the first data block information set corresponding to the image file of the latest version of the target image.
[0015] According to a third aspect, a device for determining a mirror deduplication dictionary is provided, wherein the mirror file corresponding to a single mirror is stored in the form of at least one data block, each data block corresponds to data block information uniquely determined according to a predetermined method, and the data block information is recorded in the data block information set of the corresponding mirror file, the device comprising:
[0016] The acquisition unit is configured to acquire information sets of each data block corresponding to several images, including the first image, wherein the image file corresponding to a single image is stored in the form of at least one data block, and each element in the corresponding data block information set is used to uniquely describe each data block.
[0017] A clustering unit is configured to cluster the plurality of mirrors based on the similarity of data block information sets to obtain at least one mirror category, wherein the first mirror corresponds to the first mirror category among the at least one mirror categories;
[0018] The mining unit is configured to determine a corresponding first category deduplication set based on the occurrence frequency of each data block for the first mirror category. The occurrence frequency of a single data block is determined based on the data block information in the data block information set. If the occurrence frequency of a single data block in at least one mirror corresponding to the first mirror category meets a first predetermined condition, the corresponding data block information is added to the first category deduplication set.
[0019] The prediction unit is configured to determine a first prediction deduplication set for the first mirror image based on the importance of other data blocks, excluding the data blocks indicated by the first category deduplication set;
[0020] The construction unit is configured to construct a data block deduplication dictionary using the first category deduplication set and the first prediction deduplication set, for use in deduplication during the update process of the image file of subsequent versions of the first image.
[0021] According to the fourth aspect, an apparatus for storing image files is provided, comprising:
[0022] The processing unit is configured to divide the target image's image file to be stored into multiple data blocks and determine a corresponding first data block information set, wherein the first data block information set includes each data block information that uniquely describes each data block in a predetermined manner.
[0023] The matching unit is configured to match the first data block information set with elements in the data block deduplication dictionary of the target image, wherein the data block deduplication dictionary includes: a second category deduplication set based on the second image category corresponding to the target image, and a second predicted deduplication set determined based on the historical versions of the target image;
[0024] The storage unit is configured to store the image file to be stored based on the matching result.
[0025] According to a fifth aspect, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of the first or second aspect.
[0026] According to a sixth aspect, a computing device is provided, including a memory and a processor, characterized in that the memory stores executable code, and when the processor executes the executable code, it implements the method of the first aspect or the second aspect.
[0027] The methods and apparatus provided in the embodiments of this specification are based on storing image files in a format where data blocks and data block information are stored separately. The data in the image file is divided into multiple data blocks, and the unique description information (data block information) corresponding to each data block can be stored as the image's metadata. Thus, in the process of determining the image deduplication dictionary, data block analysis can be performed using metadata with a relatively small data volume, reducing the amount of data processing. Furthermore, on the one hand, the image is clustered using the data block information in the metadata, thereby mining category deduplication sets by category. On the other hand, after excluding the data block information corresponding to the category deduplication sets, for a single image, data blocks that may be reused in the future are predicted based on the data block information in historical versions, and the corresponding data block information constitutes the predicted deduplication set. Further, the category deduplication set and the predicted deduplication set constitute the image's deduplication dictionary. Therefore, when the image is updated, the data to be stored is processed according to the data block and metadata format, and the data block information contained in the metadata is compared with the image's deduplication dictionary. This allows for the mining of reused data blocks with less data processing, avoiding redundant storage of reused data blocks and improving the efficiency of image storage and usage. Attached Figure Description
[0028] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0029] Figure 1 This diagram illustrates an implementation architecture of the technical concept described in this specification.
[0030] Figure 2 A flowchart illustrating a method for determining a mirror deduplication dictionary according to one embodiment is shown.
[0031] Figure 3 The flowchart illustrates the process of determining the mirror deduplication dictionary for a specific example.
[0032] Figure 4 A flowchart illustrating a method for storing image files according to one embodiment is shown.
[0033] Figure 5 A schematic block diagram of an apparatus for determining a mirror-based deduplication dictionary according to one embodiment is shown;
[0034] Figure 6 A schematic block diagram of a mirror file storage device according to one embodiment is shown. Detailed Implementation
[0035] The technical solutions provided in this specification are described below with reference to the accompanying drawings.
[0036] Conventional container image deduplication includes deduplication at the image file level and deduplication of data blocks within the image.
[0037] In file-level, layer-based deduplication, modifying the metadata of a file within a layer creates a new layer, and even if other file data remains unchanged, a duplicate copy of this data will be stored in the image center. The image center is a data center used to store images. In some cases, the degree of data duplication also depends on the developer's Dockerfile writing skills. For example, in Dockerfile-based builds, the degree of data duplication depends entirely on the user's Dockerfile writing skills; a poorly written Dockerfile can lead to a large number of duplicate data layers. For instance, using `touch` on an existing file in the Dockerfile.
[0038] When deduplicating data blocks within an image, the container image file can be divided into multiple data blocks, and a digest of each data block can be determined based on hash operations or other methods. Duplicate parts can then be removed by comparing the digests.
[0039] In summary, conventional container image deduplication methods typically deduplicate existing data, which may result in a large data volume.
[0040] In view of this, this specification provides a prediction-based image deduplication scheme. Based on deduplication using relatively small amounts of metadata for the image, it predicts data with a high probability of image reuse, which is then used for deduplication during the storage of newly generated image file versions. The image's metadata records data information within the image file, such as the image file size, storage location, number of data blocks, data block names, and unique data block information, etc.
[0041] The following is for reference Figure 1 The technical concept described in this specification is illustrated. For example... Figure 1 As shown, under the technical concept of this specification, the image file can first be stored in the form of metadata and data blocks. That is, the image file is divided into multiple data blocks for storage, and the corresponding metadata records the image version information and the data block description information. The data block description information (hereinafter referred to as data block information) can uniquely describe a data block; for example, the hash value calculated by the data block according to a predetermined hash algorithm can be used as the corresponding description information. This description information can be part of the metadata of the corresponding image. Thus, when deduplicating image data, deduplication can be performed at the data block level based on the metadata. Moreover, since the amount of metadata is much smaller than the image itself, the amount of data processing is greatly reduced.
[0042] like Figure 1 As shown in the diagram, wireframe 101 illustrates the process of constructing the plagiarism detection dictionary. Specifically, under the technical concept of this specification, on the one hand, images can be divided into multiple categories through clustering, and corresponding category plagiarism detection sets can be determined according to the categories; on the other hand, for an existing version of a single image, the corresponding plagiarism detection dictionary is determined based on predictions of future versions. Furthermore, for a single image (such as image 1), a personalized plagiarism detection dictionary can be constructed using the corresponding category plagiarism detection set and its own predicted plagiarism detection set.
[0043] Furthermore, each update of an image can correspond to a new version, and a single image can correspond to multiple iterative image file versions. When a single image is updated (e.g., image 1 update), the data block information set of the corresponding image file can be compared with the corresponding deduplication dictionary (e.g., ...). Figure 1 The deduplication dictionary 1) is used to match and remove duplicates. Then, data blocks not included in the deduplication dictionary are stored in the corresponding new version, such as the N1+1th version of image 1. In this way, not only can deduplication be performed on existing image data, but also information on data blocks with high reuse frequency in the image can be predicted, reducing redundant storage of reused data blocks when storing new versions of the image.
[0044] It is worth noting that in practice, the plagiarism dictionary construction scheme described in wireframe 101 can be executed incrementally or fully at a certain frequency to update the plagiarism dictionary, while the storage of a single image is based on the current plagiarism dictionary.
[0045] The following is combined with Figures 2-4 The illustrated process further describes the technical concept of this specification.
[0046] Figure 2 This illustrates the process for determining the mirror deduplication dictionary, which, for example, corresponds to... Figure 1 The portion shown in the middle box 101. The entity executing this process can be any computer, device, or server with a certain computing power, such as a mirror center that stores image files.
[0047] It should be noted beforehand that the process for determining the image deduplication dictionary is based on the storage method of storing image files according to data blocks and corresponding data block information. A data block is the unit of data storage, which can have a fixed size, such as 4 kilobytes (4k bits), or it can be determined based on the actual size of the data. For example, if a data block describes information such as the container's startup location, historical startup speed, and historical usage frequency, then the size of the data block is determined by the number of bytes occupied by this information. An image file can be divided into multiple data blocks. Under the technical concept of this specification, in order to describe the image with lighter data, each data block can be described using data block information. A data block can be described by a unique corresponding data block information. To ensure that the data block information accurately describes the data contained in the data block, the data block itself can be processed in a predetermined way to obtain the data block information. For example, a predetermined hash method can be used to calculate a 64-byte numerical element to describe the data block. In this way, for data blocks with completely identical data, the processed data block information will also be identical, while for different data blocks, the processed data block information will be inconsistent.
[0048] Since a single image can correspond to multiple data blocks, the information for each data block can form a data block information set. It's important to note that "data block information set" doesn't necessarily mean storing the individual data block information as a collection. The data block information set can be stored in any format, such as through image identifiers, storage area partitions, arrays, vectors, tables, etc. This allows for comparison of data block information instead of comparing individual data blocks, reducing the amount of data to be processed.
[0049] like Figure 2As shown, the process for determining the mirror deduplication dictionary may include: Step 201, obtaining information sets of data blocks corresponding to several mirrors, including the first mirror; Step 202, clustering the several mirrors based on the similarity of the data block information sets to obtain at least one mirror category, wherein the first mirror corresponds to the first mirror category in at least one mirror category; Step 203, for the first mirror category, determining the corresponding first category deduplication set based on the frequency of occurrence of each data block, wherein the frequency of occurrence of a single data block is determined based on the data block information in the data block information set, and if the frequency of occurrence of a single data block in at least one mirror corresponding to the first mirror category meets a first predetermined condition, the corresponding data block information is added to the first category deduplication set; Step 204, excluding the data blocks indicated by the first category deduplication set, determining a first predicted deduplication set for the first mirror regarding data block information according to the importance of other data blocks; Step 205, constructing a data block deduplication dictionary using the first category deduplication set and the first predicted deduplication set for deduplication during the mirror file update process of subsequent versions of the first mirror.
[0050] First, in step 201, the information sets of each data block corresponding to several images, including the first image, are obtained.
[0051] It's understandable that the first image can be any container image. Figure 2 In the illustrated process, for ease of description, any one of the images is referred to as the first image, and the first image is used as an example for explanation. The data block information set corresponding to a single image can be used to store the data block information corresponding to the data blocks in the corresponding image. A single data block information uniquely describes a data block.
[0052] A single image (such as the first image) can correspond to multiple versions. Therefore, the data block information set corresponding to a single image can include data block information from multiple versions (such as 3, 5, 10, etc.). Typically, the data block information from the versions most recent to the current time can be retrieved. The version number of each image can be the same or different. For example, when retrieving image data block information backward according to a predetermined time, the version number of each image is determined based on the actual number of versions generated. However, when retrieving the data block information set of the corresponding image according to a predetermined version number, the version number corresponding to each image can be the same. Within a single data block information set, data block information can be distinguished according to the storage version of the image. According to an optional embodiment, if the data block information of multiple versions of a single image is duplicated, the duplicate data block information can be merged within the data block set of that image.
[0053] In one embodiment, it can also be detected first whether each image is stored in a format that includes data and metadata (including data block information). If it is stored in this format, the data block information set of each image is obtained; if it is not stored in this format, the corresponding image can be stored and processed in this format first, and the data block information set corresponding to each image can be obtained.
[0054] Next, in step 202, based on the similarity of the data block information sets, the above-mentioned several mirrors are clustered to obtain at least one mirror category.
[0055] It's understandable that if the data block information sets corresponding to the images of two containers are similar, then the amount of duplicate data blocks in these two container images is relatively large, meaning that data blocks appearing in one image are likely to appear in another. Furthermore, data blocks that appear frequently in a class of images have a higher probability of appearing in subsequent versions of that class of images. Therefore, clustering images can uncover more data blocks that appear frequently in that class of images. For example, if images A, B, C, and D each have 40% identical data blocks, and images A, B, and C also have identical data blocks a and b, then a and b have a high probability of appearing in image D. Through clustering, these data blocks a and b can be extracted and used for deduplication queries in images A, B, C, and D. This avoids redundant storage of data blocks a, b, etc., in image D.
[0056] It's understandable that for datasets with the same number of elements, their similarity can be measured using parameters such as cosine similarity, variance, Euclidean distance, information entropy, and correlation coefficient. For datasets with different numbers of elements, similarity can be measured using methods like the Jaccard coefficient. Under the principle of the Jaccard coefficient, the similarity between set A and set B is, for example, the intersection of A and B / the union of A and B. Here, the intersection of A and B represents the number of identical elements in both sets, and the union of A and B represents the total number of elements in both sets after merging the identical data.
[0057] Based on similarity detection results, the mirror images can be clustered, with images exhibiting high similarity grouped into one category. In one embodiment, clustering can be performed based on similarity, such as grouping two images whose data block information sets have a similarity exceeding a predetermined threshold into one category. In another embodiment, to perform clustering, the images can be mapped to points in space, and clustered according to the distance between points using density-based spatial clustering of applications with noise. Distance-based clustering spatializes the images, ensuring that images with higher data block information set similarity are closer together, thus achieving better expected results. Examples of distance-based clustering methods include the DBSCAN algorithm, OPTICS algorithm, and DENCLUE algorithm. Since images with higher data block information set similarity are closer together, similarity is negatively correlated with distance. In an optional example, the distance between two images A and B can be 1-α, where α is the similarity between the two corresponding data block information sets.
[0058] In this way, mirror images corresponding to data block information sets with high similarity can be clustered into a single category, such as... Figure 1 In this diagram, mirror image 1 and mirror image n can belong to the same cluster, while mirror image 2 and mirror image 3 can belong to the same cluster. For ease of description later, the mirror image cluster corresponding to the first mirror image can be denoted as the first mirror image cluster. Due to the arbitrariness of the first mirror image, the first mirror image cluster also has universality; in other words, the subsequent processing for the first mirror image cluster also applies to other mirror image clusters.
[0059] Then, in step 203, for the first mirror category, the corresponding first category deduplication set is determined based on the frequency of occurrence of each data block.
[0060] Generally, within a cluster's mirror image, the higher the frequency of a data block's occurrence, the more prevalent its application; for example, system data is more likely to be repeated in subsequent versions. Therefore, the frequency of a data block can be statistically analyzed using the data block information sets corresponding to a cluster category. For instance, if a value appears once in the data block information sets corresponding to a cluster category, the frequency of the corresponding data block is incremented by 1. In possible implementations, the frequency can also be normalized, for example, by dividing the above statistical count by the total number of values in each data block information set.
[0061] Furthermore, for the first mirror category, based on the data block information set corresponding to each mirror within this category, a portion of the data block information (such as hash values) can be selected and added to the category deduplication set corresponding to the first mirror category, arranged in descending order of frequency of occurrence. This set is denoted as the first category deduplication set. For example, data blocks with a frequency greater than a predetermined threshold (such as 5) can be selected as reference data blocks for deduplication in the first mirror category, and their data block information can be added to the first category deduplication set. As another example, a predetermined number (such as 100) of data blocks with high frequency of occurrence can be selected as reference data blocks for deduplication, and their data block information can be added to the first category deduplication set. The first category deduplication set can be used for deduplication and deduplication of each mirror within the first mirror category. Similarly, a similar method can be used to determine the corresponding category deduplication set for each mirror category.
[0062] On the other hand, according to step 204, after filtering out the data blocks indicated by the category deduplication set corresponding to the first category of the mirror image, a first prediction deduplication set about the data block information is determined for the first mirror image according to the importance of other data blocks.
[0063] It's understandable that a deduplication set determined by mirror category has a broader scope and more comprehensive information on potentially reusable data blocks than one determined by a single mirror. Data blocks corresponding to a deduplication set determined by mirror category have a higher probability of being reused within that category. However, for a single mirror, some data blocks may be unique to that mirror; therefore, the frequency of occurrence may not meet the required criteria when determining the deduplication set by category. Therefore, it's necessary to further predict the importance of other data blocks within the mirror's own data block information set, after excluding data blocks from the category-based deduplication set to reduce interference. The data block information corresponding to the more important data blocks is then selected and added to the unique predicted deduplication set for that single mirror. This step will still be described using the first mirror as an example.
[0064] In one embodiment, for the data block information set of the most recent multiple versions of the first image, after excluding the data block information in the first category of deduplication set, the frequency of occurrence of other data blocks can be counted through other data block information to determine the corresponding importance. Data blocks with a frequency higher than a small predetermined threshold (such as 2) or ranked higher are added to the corresponding predictive deduplication set, which is called the first predictive deduplication set.
[0065] In another embodiment, considering the updates and replacements of some computer technologies, the probability of data blocks that only appear in earlier versions appearing in later versions is small. Therefore, the importance scores corresponding to each data block can be determined according to the time sequence of each version, and the corresponding data block information can be selected and accessed into the first prediction deduplication set based on the importance scores.
[0066] For example, the weights of data blocks can be determined according to the chronological order of each version, thus weighting the importance score of each data block. Generally, the earlier the version, the lower the importance weight of the corresponding data block, and the later the version, the higher the importance weight of the corresponding data block. For example, the weights of the frequency of a data block appearing in 10 versions are 0.01, 0.03, 0.08, 0.15, 0.25...0.4, and so on. By accumulating the weights of whether a data block appears in each version, the importance score of each data block can be obtained.
[0067] For example, based on the exponential smoothing formula, the frequency of occurrence of corresponding data block information in each data block information set can be iteratively weighted according to version time order to obtain the importance score for each data block. The exponential smoothing formula is a special type of weighted moving average method, which gives greater weight to data that appears later in the list. For example: S t =aY t +(1-a)S t-1 Where a is the smoothing coefficient, Y t S represents the frequency of data blocks appearing in the current version. t-1 S represents the accumulated importance score prior to the current version. t This represents the importance score accumulated over the current version. Typically, Y... t The value of Y is either 0 or 1. Specifically, if the current data block appears in the current version, then Y... t =1, otherwise, Y t =0. Additionally, in the earliest version, S t-1 =S0 is initialized to 0, and the value of 'a' is usually greater than 0.5 and less than 1 (e.g., 0.8) to give higher weight to later versions. By shifting the weights according to the time sequence, the importance score of each data block can be obtained.
[0068] The frequency statistics and moving weighting mentioned above can all be completed using the data block information set, significantly reducing the amount of data processing compared to comparing individual data blocks. With the importance scores corresponding to each data block, the data block information of the data blocks with the highest importance scores can be added to the first prediction deduplication set. Specifically, data blocks with importance scores greater than the importance threshold can be selected, or data blocks ranked before a predetermined position (e.g., the 10th) in descending order can be selected; no specific limitation is imposed here.
[0069] Similarly, for each of the other mirror images, a corresponding prediction deduplication set can be determined.
[0070] In step 205, a data block deduplication dictionary is constructed using the first category deduplication set and the first predicted deduplication set, for use in deduplication during the update process of the image file in subsequent versions of the first image.
[0071] It can be understood that the first category deduplication set contains data block information of frequently occurring data blocks in the image files of all images in the first image category. Therefore, it applies to all images in the first image category, including the first image. The first prediction deduplication set, on the other hand, contains data block information for the first image, based on predictions of data blocks with a high probability of reuse after excluding data block information from the first category deduplication set. Thus, since the first image belongs to the first image category, the first category deduplication set and the first prediction deduplication set corresponding to the first image can be merged to form the data block deduplication dictionary corresponding to the first image, denoted as the first data block deduplication dictionary. Similarly, each image can have its own data block deduplication dictionary. In practice, the category deduplication set and the prediction deduplication set may not be merged, but rather bound to the corresponding image through corresponding identifiers. For example, the first image category or the first category deduplication set, as well as the first prediction deduplication set, can be identified in the metadata of the first image.
[0072] Figure 2 The illustrated process describes a scheme for determining the global deduplication dictionary for mirrored data, which can be executed multiple times. By executing this process, based on existing history and future predictions, potentially reusable data blocks in the mirror can be identified, and their corresponding data block information can be stored in the appropriate dataset.
[0073] The following example illustrates how clustering before determining the prediction deduplication set for each mirror image, and how determining the category deduplication set based on the cluster category, can reduce interference. For example, mirror image A and mirror image B belong to the same category. If deduplication prediction is performed directly on each mirror image, the final data block information set may contain some overlap between mirror images A and B. Without clustering, there might also be mirror image C, which may also have overlaps with data blocks from mirror images A and B. If frequency filtering is performed on this mirror image category first, some of the duplicate data can be identified. Performing deduplication prediction at this point will reduce the amount of duplicate data, thereby reducing the amount of data processing.
[0074] In particular, because this manual records the image using a storage method that combines data blocks and data block information, it is important to note that upon first use... Figure 2 The illustrated process for determining the image deduplication dictionary can also check whether the image is stored in this format. If not, the image is first stored according to the data block and metadata storage format. For example... Figure 3 The diagram illustrates the specific process for determining a mirror deduplication dictionary.
[0075] like Figure 3 As shown, in this specific process, "Start" indicates that the process is triggered, and the triggering condition can be the arrival of a predetermined period, the arrival of a predetermined time point, or manual triggering. After the process starts, it first checks whether each image has been stored according to the storage format of data blocks and metadata. If not, the image is first converted to this storage format. The metadata includes the data block information of each data block. If each image has been stored according to this format, then for each image, the most recent iterations are selected to form the corresponding image's data block information set. The data block set of a single image can merge identical items. That is, if multiple versions contain data block 'a', the data block information set can contain only the data block information of one data block 'a'. Then, based on the data block information set, the distance between each image is calculated, and clustering is performed based on the distance to divide the images into multiple categories.
[0076] On one hand, for images of the same category, the frequency of occurrence of each data block can be statistically analyzed using a data block information set to filter out the first batch of data blocks with higher frequency, and the corresponding data block information can be used as the category deduplication set. On the other hand, for a single image, the future importance of a corresponding data block can be predicted. The prediction process is based on historical versions, with newer versions having higher data block importance. Thus, the most recent iterations can be selected, and the corresponding data block information can be obtained for each iteration, from which data block information in the corresponding category deduplication set can be filtered out. It is worth noting that data block information from each version must be used here; therefore, identical data blocks from different versions cannot be merged. Based on the frequency of occurrence of data blocks in each version, each data block can be scored, such as determining an importance score. Then, the data block information with higher scores is selected as the predicted deduplication set. For a single image, the category deduplication set and the predicted deduplication set corresponding to the corresponding category can be used together as a data block deduplication dictionary for storage and conversion of subsequent version files of that image.
[0077] Figure 4 The process of storing image files is shown, and this process is based on Figure 2 A new version of the image is stored based on the deduplication dictionary for the defined data blocks. Figure 4 The execution entity of the illustrated process can be a computer, device, or server with a certain computing power, which is related to... Figure 2 The entities executing the processes shown may be the same or different.
[0078] refer to Figure 4 As shown, the method for updating the image includes the following steps:
[0079] Step 401: Divide the target image file to be stored into multiple data blocks and determine the corresponding first data block information set.
[0080] As mentioned earlier, each update to a single image can be recorded as a new version. The target image can be any image that has been updated and requires data storage. For example, if a container finishes using its data and generates an update, the corresponding data can be stored.
[0081] First, the target image file to be stored can be divided into multiple data blocks, stored in a format that separates data and data block information. Here, "data" refers to the individual data blocks. Data segmentation can follow predetermined rules, such as allocating system configuration files to a single data block. Data block information uniquely describes the information of a single data block. This can be stored, for example, as metadata. Metadata describes the data to be stored in the target image, such as storage time and size. The metadata can include data block information corresponding to each data block. All data block information constitutes the data block information set of the target image's data to be stored, referred to here as the first data block information set. A single data block information entry can be, for example, a hash value obtained by hashing the data block.
[0082] Step 402: Match the first data block information set with the elements in the data block deduplication dictionary of the target image.
[0083] The target image can be any container image. The data block deduplication dictionary for the target image can include a second-category deduplication set and a second-predicted deduplication set. Here, "second" corresponds to the target image and is distinguished from the first image, first-category deduplication set, and first-predicted deduplication set mentioned earlier, without any other substantial limitations. Due to the arbitrariness of the target image, it may also be the first image described earlier; in this case, "first..." and "second..." are consistent.
[0084] The second category deduplication set can be a category deduplication set corresponding to the second mirror category determined by clustering the target mirror, while the second predicted deduplication set can be a predicted deduplication set obtained by excluding data block information from the data block information in the second category deduplication set based on the historical versions of the target mirror. The second category deduplication set and the second predicted deduplication set can be obtained through... Figure 2 The process shown is predetermined and will not be described in detail here.
[0085] The process of matching the first data block information set with the elements in the target mirror's data block deduplication dictionary can also be viewed as the process of retrieving elements from the first data block information set from the data block deduplication dictionary. Its purpose is to discover reusable data blocks within the first data block information set. Specifically, if a corresponding element (such as 'a') from the first data block information set is matched in the data block deduplication dictionary, the corresponding data block (such as data block A corresponding to data block information 'a') is a reusable data block.
[0086] When the data block information is a hash value, the matching process can be a numerical comparison process. The matching result can be that all or part of the elements of the first data block information set are included in the data block deduplication dictionary, or that no elements of the first data block information set are included in the data block deduplication dictionary.
[0087] Step 403: Based on the matching results, store the image file to be stored.
[0088] Specifically: if some elements in the first data block information set match the data block deduplication dictionary, exclude the data blocks corresponding to the matching elements from the multiple data blocks obtained by splitting the image file to be stored, and store the other data blocks; if all elements in the first data block information set have matching elements in the data block deduplication dictionary, the multiple data blocks obtained by splitting the image file to be stored do not need to be stored; if none of the elements in the first data block information set are in the data block deduplication dictionary, all the multiple data blocks obtained by splitting the image file to be stored can be stored.
[0089] Understandably, in order to continuously adapt the data block deduplication dictionary to the current deduplication requirements, each mirror can be deduplicated at predetermined time intervals, such as... Figure 2 The process for determining the image deduplication dictionary is shown. Therefore, to more accurately predict reusable data blocks based on historical versions, when storing new version data of the image, regardless of whether new data blocks need to be stored, the data block information of all data blocks obtained by splitting the image file to be stored can be stored in the metadata of the latest version of the target image.
[0090] Reviewing the above process, the technical solution provided in this specification stores the image in a format that separates data and data block information. The data is divided into multiple data blocks, and the unique description information corresponding to each data block is stored in the metadata. Thus, in the process of determining the image deduplication dictionary, the metadata, which has a relatively small data volume, can be used to analyze data blocks, reducing the amount of data processing. Furthermore, on the one hand, the image is clustered using the data block information in the metadata, thereby mining category-based deduplication sets. On the other hand, after excluding the data block information corresponding to the category-based deduplication sets, for a single image, based on the data block information in historical versions, potentially reusable data blocks are predicted, forming a predictive deduplication set. Further, the category-based deduplication set and the predictive deduplication set constitute the image's deduplication dictionary. Therefore, when the image is updated, the data to be stored is processed according to the data block and metadata formats, and the data block information contained in the metadata is compared with the image's deduplication dictionary. This allows for the discovery of reusable data blocks with less data processing, avoiding redundant storage of reusable data blocks and improving the efficiency of image storage and usage.
[0091] According to another embodiment, a device for determining a mirror deduplication dictionary is also provided. This device can be installed on any device, apparatus, or server with a certain computing capability. Figure 5 An embodiment of a mirror deduplication dictionary determination apparatus 500 is shown. For example... Figure 5 As shown, the device 500 includes:
[0092] The acquisition unit 501 is configured to acquire the data block information sets corresponding to several images, including the first image;
[0093] Clustering unit 502 is configured to cluster several mirrors based on the similarity of data block information sets to obtain at least one mirror category, wherein the first mirror corresponds to the first mirror category in the at least one mirror category;
[0094] Mining unit 503 is configured to determine the corresponding first category deduplication set based on the occurrence frequency of each data block for the first mirror category. The occurrence frequency of a single data block is determined based on the data block information in the data block information set. If the occurrence frequency of a single data block in at least one mirror corresponding to the first mirror category meets the first predetermined condition, the corresponding data block information is added to the first category deduplication set.
[0095] Prediction unit 504 is configured to determine a first prediction deduplication set for the first mirror image based on the importance of other data blocks, excluding the data blocks indicated by the first category deduplication set;
[0096] Construction unit 505 is configured to construct a data block deduplication dictionary using the first category deduplication set and the first predicted deduplication set, for use in deduplication during the image file update process of subsequent versions of the first image.
[0097] According to another embodiment, an apparatus for storing image files is also provided. This apparatus can be located in any device, apparatus, or server with a certain computing capability. For example, it can be located together with apparatus 500 in an image center. Figure 6 An embodiment of an image file storage device 600 is shown. For example... Figure 6 As shown, the device 600 includes:
[0098] The processing unit 601 is configured to divide the target image's image file to be stored into multiple data blocks and determine a corresponding first data block information set, wherein the first data block information set includes each data block information that uniquely describes each data block in a predetermined manner.
[0099] The matching unit 602 is configured to match the first data block information set with the elements in the data block deduplication dictionary of the target image. The data block deduplication dictionary includes: a second category deduplication set based on the second image category corresponding to the target image, and a second predicted deduplication set determined based on the historical version of the target image.
[0100] Storage unit 603 is configured to store the image file to be stored based on the matching result.
[0101] It is worth noting that, Figure 5 , Figure 6 The devices 500, 600 and shown are Figure 2 , Figure 4 Corresponding to the described method, Figure 2 , Figure 4 The corresponding descriptions in the method embodiments also apply to devices 500 and 600, and will not be repeated here.
[0102] According to another embodiment, a computer-readable storage medium is also provided, on which a computer program is stored, which, when executed in a computer, causes the computer to perform a combination Figure 2 or Figure 4 The methods described above.
[0103] According to another embodiment, a computing device is also provided, including a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, it implements a combination... Figure 2 or Figure 4 The methods described above.
[0104] Those skilled in the art will recognize that the functions described in the embodiments of this specification in one or more of the above examples can be implemented using hardware, software, firmware, or any combination thereof. When implemented in software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or code on a computer-readable medium.
[0105] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the technical concept in this specification. It should be understood that the above description is only a specific embodiment of the technical concept in this specification and is not intended to limit the scope of protection of the technical concept in this specification. Any modifications, equivalent substitutions, improvements, etc., made on the basis of the technical solutions of the embodiments in this specification should be included within the scope of protection of the technical concept in this specification.
Claims
1. A method for determining a mirror-based deduplication dictionary, wherein, A single image file is stored in the form of at least one data block, and each data block corresponds to data block information uniquely determined according to a predetermined method. The data block information is recorded in the data block information set of the corresponding image file. The method includes: Obtain the data block information sets corresponding to each of the several images, including the first image; Based on the similarity of data block information sets, the plurality of mirrors are clustered to obtain at least one mirror category, wherein the first mirror corresponds to the first mirror category among the at least one mirror categories; For the first image category, a corresponding first category deduplication set is determined based on the occurrence frequency of each data block. The occurrence frequency of a single data block is determined based on the data block information in the data block information set corresponding to the first image category. If the occurrence frequency of a single data block in at least one image corresponding to the first image category meets a first predetermined condition, the corresponding data block information is added to the first category deduplication set. In the case of excluding the data blocks indicated by the first category of deduplication set, a first predictive deduplication set is determined for the first mirror image based on the importance of other data blocks; A first data block deduplication dictionary is constructed using the first category deduplication set and the first predicted deduplication set, for use in deduplication during the update process of the image file of subsequent versions of the first image.
2. The method according to claim 1, wherein, The first image corresponds to multiple versions of image files. The step of determining the first predictive deduplication set for the first image based on the importance of other data blocks includes: Based on the subset of data block information corresponding to each version of the image file, the importance score of each data block is determined according to the time order of each version. Based on the magnitude of each importance score, select the corresponding data block information and add it to the corresponding prediction deduplication set.
3. The method according to claim 2, wherein, The process of determining the importance scores for each data block according to the chronological order of each version includes: For a single data block, based on the exponential smoothing formula, the frequency of occurrence of the corresponding data block information in each version is iteratively weighted according to the version time order to obtain the corresponding importance score. In the iterative weighting process, the initial score is 0, the weight of the accumulated score is 'a', and the weight of the current version is 1-a, where 'a' is a number greater than 0.5 and less than 1. If the data block appears in the current version's mirror file, its frequency of occurrence is 1; otherwise, it is 0.
4. The method according to claim 1, wherein, The predetermined method is as follows: a single data block is processed according to a predetermined hash algorithm, and the corresponding data block information is the hash value obtained after processing.
5. The method according to claim 1, wherein, When the first image corresponds to multiple versions of image files, obtaining the data block information sets corresponding to each image, including the first image, includes: Obtain the corresponding data block information sets for the latest multiple versions of the first image.
6. The method according to claim 1, wherein, The similarity between pairs of data block information sets is determined by the ratio of the number of identical data block information entries to the total number of data block information entries.
7. The method according to claim 1, wherein, The clustering of the several mirror images based on the similarity of the data block information sets, to obtain at least one mirror image category, includes: Based on the similarity of data block information sets, the distance between each pair of mirror images is determined, wherein the similarity between each pair of data block information sets is negatively correlated with the distance between the corresponding pair of mirror images; Based on the distance between each pair of mirror images, each mirror image is clustered using a distance-related clustering method to obtain at least one mirror image category.
8. The method according to claim 7, wherein, The clustering method is density-based clustering of noisy application spaces, implemented using one of the following algorithms: DBSCAN, OPTICS, or DENCLUE.
9. A method for storing an image file, comprising: The target image file to be stored is divided into multiple data blocks, and a corresponding first data block information set is determined. The first data block information set includes each data block information that uniquely describes each data block in a predetermined manner. The first data block information set is matched with the elements in the data block deduplication dictionary of the target image, wherein the data block deduplication dictionary includes: a second category deduplication set based on the second image category corresponding to the target image, and a second predicted deduplication set determined based on the historical versions of the target image; Based on the matching results, the image file to be stored is stored.
10. The method according to claim 9, wherein, The step of storing the image file to be stored based on the matching result includes: If a matching element exists, the data block indicated by the matching element among the plurality of data blocks is filtered out and the remaining data blocks are stored. Otherwise, all of the data blocks will be stored.
11. The method according to claim 9, wherein, The step of storing the image file to be stored based on the matching result includes: The first data block information set is stored in correspondence with the latest version of the target image file.
12. A device for determining a mirror-based deduplication dictionary, wherein, A single image file is stored in the form of at least one data block, and each data block corresponds to data block information uniquely determined according to a predetermined method. The data block information is recorded in the data block information set of the corresponding image file. The apparatus includes: The acquisition unit is configured to acquire the data block information sets corresponding to several images, including the first image; A clustering unit is configured to cluster the plurality of mirrors based on the similarity of data block information sets to obtain at least one mirror category, wherein the first mirror corresponds to the first mirror category among the at least one mirror categories; The mining unit is configured to determine a corresponding first category deduplication set based on the occurrence frequency of each data block for the first mirror category. The occurrence frequency of a single data block is determined based on the data block information in the data block information set corresponding to the first mirror category. If the occurrence frequency of a single data block in at least one mirror corresponding to the first mirror category meets a first predetermined condition, the corresponding data block information is added to the first category deduplication set. The prediction unit is configured to determine a first prediction deduplication set for the first mirror image based on the importance of other data blocks, excluding the data blocks indicated by the first category deduplication set; The construction unit is configured to construct a data block deduplication dictionary using the first category deduplication set and the first prediction deduplication set, for use in deduplication during the update process of the image file of subsequent versions of the first image.
13. An apparatus for storing image files, comprising: The processing unit is configured to divide the target image's image file to be stored into multiple data blocks and determine a corresponding first data block information set, wherein the first data block information set includes each data block information that uniquely describes each data block in a predetermined manner. The matching unit is configured to match the first data block information set with elements in the data block deduplication dictionary of the target image, wherein the data block deduplication dictionary includes: a second category deduplication set based on the second image category corresponding to the target image, and a second predicted deduplication set determined based on the historical versions of the target image; The storage unit is configured to store the image file to be stored based on the matching result.
14. A computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any one of claims 1-11.
15. A computing device, comprising a memory and a processor, characterized in that, The memory stores executable code, and when the processor executes the executable code, it implements the method of any one of claims 1-11.