Data compression method and apparatus, and computer device

The data compression method uses a recommendation record to apply existing coding rules directly or estimate new ones when necessary, addressing inefficiencies in database compression by reducing computational costs and enhancing resource utilization.

JP7883360B2Active Publication Date: 2026-07-01BEIJING OCEANBASE TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Patents
Current Assignee / Owner
BEIJING OCEANBASE TECHNOLOGY CO LTD
Filing Date
2021-06-29
Publication Date
2026-07-01

AI Technical Summary

Technical Problem

Existing data compression methods in databases require significant computational resources to select appropriate compression coding rules, leading to inefficiencies in CPU and memory usage, especially in hybrid row and column coding scenarios.

Method used

A data compression method utilizing a recommendation record to store compression coding rules and their corresponding ratios for previously compressed objects, allowing direct application of suitable rules if available, or initiating a normal process to estimate rules if not, thereby reducing the need for repeated calculations.

Benefits of technology

This approach significantly improves compression efficiency by eliminating the need to recalibrate coding rules for each compression, optimizing resource usage and reducing computational overhead.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 0007883360000001
    Figure 0007883360000001
  • Figure 0007883360000002
    Figure 0007883360000002
  • Figure 0007883360000003
    Figure 0007883360000003
Patent Text Reader

Abstract

To provide a data compression method and apparatus, and a computer device.SOLUTION: A data compression method comprises: a step 104 of searching for recommended records about whether or not compression coding rules exist; a step of compressing an object to be compressed by using the recommended compression coding rule if the recommended compression coding rule, which satisfies a compression ratio condition, exists; a step of starting a normal compression coding process to obtain an estimated compression ratio of the plurality of compression coding rules for the object to be compressed, selecting a target compression coding rule based on at least the estimated compression ratio, and compressing the object to be compressed by using the target compression coding rule if there is no recommended compression coding rule, which satisfies a compression ratio condition.SELECTED DRAWING: Figure 1
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This specification relates to the field of data processing technology, and in particular, to data compression methods and apparatuses, as well as computer devices.

Background Art

[0002] Data compression can be understood as representing more information with less coding. When compressing data, it is necessary to select appropriate compression coding rules to obtain better compression efficiency. Currently, there are more and more compression coding rules, and a relatively large computational cost needs to be paid to select appropriate compression coding rules.

Summary of the Invention

Means for Solving the Problems

[0003] To mitigate problems in related technologies, this specification provides a data compression method and apparatus, as well as a computer device.

[0004] A data compression method is provided, according to a first embodiment of the implementation of this specification, which includes the steps of: obtaining an object to be compressed; searching for a recommendation record to determine whether there is a recommended compression coding rule that satisfies compression ratio conditions, wherein the recommendation record is used to record compression coding rules and corresponding compression ratio information for previously compressed objects, and the previously compressed objects are of the same type as the object to be compressed; compressing the object to be compressed by using the recommended compression coding rule if there is a recommended compression coding rule that satisfies compression ratio conditions; and, if there is no recommended compression coding rule that satisfies compression ratio conditions, initiating a normal compression coding process to obtain estimated compression ratios for multiple compression coding rules for the object to be compressed, selecting a target compression coding rule based on at least the estimated compression ratios, and compressing the object to be compressed by using the target compression coding rule.

[0005] Optionally, a compressed object includes compressed data units obtained by splitting the compressed data, and a previously compressed object includes other previously compressed data units obtained by splitting the compressed data.

[0006] Optionally, a compressed object includes compressed data units obtained by splitting the compressed data within a data table, and a previously compressed object includes previously compressed data units corresponding to the data table.

[0007] Optionally, objects to be compressed include columns of data to be compressed in the data table, and previously compressed objects include previously compressed data in the same columns as the data to be compressed in the data table.

[0008] Optionally, objects to be compressed include copies of their data tables, and previously compressed objects include the main data tables corresponding to the copies of their data tables.

[0009] Optionally, the compression ratio condition includes at least the compression ratio information of the recommended compression coding rule being higher than the specified threshold.

[0010] Optionally, this method further includes the step of compressing the object to be compressed and then updating the recommended record based on the compression coding rules used for the object to be compressed.

[0011] Optionally, the step of updating the recommendation record based on the compression coding rules used for the object to be compressed includes the step of obtaining the actual compression ratio of the object to be compressed and the step of updating the recommendation record based on at least the actual compression ratio and the compression coding rules used for the object to be compressed.

[0012] Optionally, the method further includes the step of obtaining access requirement information for an object to be compressed, wherein the access requirement information relates to the decompression efficiency of the object to be compressed, and the compression ratio condition includes that the compression ratio information of a recommended compression coding rule matches the access requirement information.

[0013] Optionally, the step of selecting a target compression coding rule based on at least an estimated compression ratio includes the step of selecting a target compression coding rule based on an estimated compression ratio and access requirements information.

[0014] Optionally, access requirements information for objects to be compressed can be determined by retrieving historical access data for one or more objects to be compressed or objects that have been compressed in the past.

[0015] Optionally, past access data may include past access frequency.

[0016] Optionally, a step to update the recommendation record based on at least the actual compression ratio and the compression coding rules used for the object being compressed includes a step to update the recommendation record based on the actual compression ratio, access requirements information, and the compression coding rules used for the object being compressed.

[0017] Optionally, the compression ratio information for a compression coding rule includes a confidence level indicating the actual compression ratio of the compression coding rule, and the steps to update the recommendation record based on the actual compression ratio, access requirements information, and the compression coding rule used for the object to be compressed include increasing the confidence level of the recommended compression coding rule if the compression coding rule used for the object to be compressed is the recommended compression coding rule recorded in the recommendation record and the actual compression ratio of the object to be compressed matches the access requirements information, and decreasing the confidence level of the recommended compression coding rule if they do not match, and replacing the recommended compression coding rule in the recommendation record with the compression coding rule used for the object to be compressed if the compression coding rule used for the object to be compressed is different from the recommended compression coding rule recorded in the recommendation record and the actual compression ratio of the object to be compressed matches the access requirements information.

[0018] According to a second aspect of the implementation of this specification, a data compression device is provided, the device including: an acquisition module configured to acquire an object to be compressed; a search module configured to search for a recommendation record for whether there is a recommended compression coding rule that satisfies compression ratio conditions, wherein the recommendation record is used to record the compression coding rule and corresponding compression ratio information of a previously compressed object, and the previously compressed object is of the same type as the object to be compressed; a first compression module configured to compress the object to be compressed by using a recommended compression coding rule if there is a recommended compression coding rule that satisfies compression ratio conditions; and a second compression module configured to start a normal compression coding process to acquire the estimated compression ratio of multiple compression coding rules for the object to be compressed, select a target compression coding rule based on at least the estimated compression ratio, and compress the object to be compressed by using the target compression coding rule if there is no recommended compression coding rule that satisfies compression ratio conditions.

[0019] According to a third aspect of the implementation of this specification, a computer device is provided which includes memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the above-described implementation of the data compression method when the program is executed.

[0020] The technical solutions provided in the implementations described herein can yield the following beneficial effects:

[0021] In the implementation form of this specification, since the compression coding rules of past compressed objects and the corresponding compression rate information are recorded in the recommended records, the recommended records can search whether there is an appropriate compression coding rule based on the compression rate condition. If there is no appropriate compression coding rule, a normal compression coding process is started to compress the object to be compressed. If there is an appropriate compression coding rule, the compression coding rule can be directly used for compression. Therefore, there is no need to consume time and resources to calculate an appropriate compression coding rule for the current object to be compressed, thereby greatly improving the compression efficiency.

[0022] It should be understood that the above general description and the following detailed description are only examples for explanation and do not limit this specification.

[0023] The accompanying drawings incorporated in this specification and constituting a part of this specification show implementation forms consistent with this specification and, together with this specification, serve to explain the principles of this specification.

Brief Description of the Drawings

[0024] [Figure 1] It is a flowchart showing a data compression method according to an exemplary implementation form of this specification. [Figure 2] It is a schematic diagram showing a data compression method according to an exemplary implementation form of this specification. [Figure 3] It is a hardware structure diagram showing a computer device where a data compression device according to an exemplary implementation form of this specification is located. [Figure 4] It is a block diagram showing a data compression device according to an exemplary implementation form of this specification.

Modes for Carrying Out the Invention

[0025] Embodiments are described in detail in this specification, and examples thereof are presented in the accompanying drawings. When the following description refers to the accompanying drawings, unless otherwise specified, the same numbers in different accompanying drawings represent the same or similar elements. The implementation forms described in the following exemplary embodiments do not represent all implementation forms that conform to this specification. On the contrary, the implementation forms are merely examples of devices and methods that are described in detail in the appended claims and that conform to some aspects of this specification.

[0026] The terms used in this specification are for the purpose of merely describing specific implementation forms and are not intended to limit this specification. The singular terms "a" and "the" used in this specification and the appended claims are intended to include the plural as well, unless clearly specified otherwise in the context. The term "and / or" used in this specification indicates any or all possible combinations of one or more of the associated listed items and should be further understood to include them.

[0027] In this specification, terms such as "first", "second", "third", etc. may be used to describe various types of information, but it should be understood that the information is not limited by the conditions. These terms are only used to distinguish the same type of information. For example, without departing from the scope of this specification, the first information can also be called the second information, and similarly, the second information can also be called the first information. Depending on the context, for example, the word "if" used in this specification can be explained as "during which", "when", or "depending on the decision".

[0028] In relational databases, data query processing and data storage are always two core elements. Generally, data query processing requires more central processing unit (CPU) and memory resources, while data storage primarily consumes storage hardware resources such as disks. However, currently, the development of storage hardware is not keeping pace with the development of other system hardware. The rate at which storage hardware space increases cannot always meet the storage requirements of users who want to store a wider range of data for longer periods. Furthermore, the access latency of storage hardware is currently at least an order of magnitude slower than that of the CPU and memory. In conclusion, space and access latency are two major problems hindering the rapid development of database storage.

[0029] To mitigate the above problems, all database manufacturers in the industry are actively adopting common data compression techniques, compressing data at the page / block level based on the storage of each database. This not only significantly reduces storage space utilization but also improves storage access latency by decreasing the amount of data input / output (IO) during extensive data scans and queries. However, the cost lies in the increased CPU and memory resources consumed during the compression and decompression processes. From a database-wide perspective, the adoption of compression techniques improves the overall balance of resource usage, allowing for the meeting of more stringent service requirements for users under the same hardware conditions.

[0030] Compression can be understood as representing more information through less coding. A common compression method in hybrid compression of database rows and columns is to compress different data using different coding rules based on data distribution.

[0031] The theoretical limit of data compression is the corresponding information entropy; that is, the more data information is retrieved, the higher the final data compression ratio. In general data compression techniques, to support a wider range of use scenarios, a sliding dictionary is typically established only for the byte stream for coding compression, and a specific range of contextual information is used. However, relational databases naturally have richer background knowledge about the data within the database. For example, the type of data column can be used to predict the distribution of ranges of data values, and data in the same column has stronger clustering. Therefore, more and more database manufacturers are dividing data based on compression units, using column storage formats for storage within each compression unit, and using contextual relationships within columns and data relationships between columns to find the coding rule that best matches the compression, such as lexicographical coding, run-length coding, and incremental coding. Compared to general compression, these coding rules can provide a higher compression ratio by using embedded data information, and by using embedded storage formats, it also becomes possible for the database to directly query based on the coded data. In particular, in databases with a log structure merge tree (LSM tree) architecture, stored data needs to be continuously compressed, and since the compression operation involves rewriting the entire data, it is natural to add row and column hybrid coding compression to the data reshuffling process.

[0032] While databases that internally support hybrid row and column coding compression can achieve higher compression ratios, this comes with the complex problem of selecting and discovering compression coding rules. The database needs to traverse column data across all compression units to extract summary data, evaluate the final compression size based on different compression rules, and ultimately determine the selected coding rule. The entire coding rule discovery process requires significant pre-computation and matching comparisons. Furthermore, currently, compression coding rules are typically only discovered for all data contexts within the same compression unit. As the scope expands, the complexity of discovery increases exponentially.

[0033] The coding compression of each database is basically implemented based on a hybrid storage mode of rows and columns. A certain number of data record tuples are selected as the overall compression unit, and all the data within the compression unit is stored column by column. The database performs a traversal scan on each column of data to obtain the basic distribution and characteristics of the data columns, and then selects an appropriate rule for coding the compression based on the specific data characteristics. Different databases implement different coding compression rules. Generally, this basically includes lexicographical coding, run-length coding, incremental coding, numeric coding, and partially open-source general compression coding. In this implementation form, the speed of selecting a compression coding rule is relatively slow. Before determining the compression coding rule to be finally selected for the data columns, the database must go through several stages, including data scan analysis, pre-calculation of the compression ratios of different coding rules, selection of coding rules, and data compression. Each stage requires paying a relatively large computational cost.

[0034] With this in mind, the implementation described herein provides a data compression solution. The solution in the implementation is related to a recommendation record. The recommendation record is used to record the compression coding rules and corresponding compression ratio information of previously compressed objects. Previously compressed objects are of the same type as the object to be compressed. Therefore, in the implementation described herein, when the object to be compressed is to be compressed, the recommendation record can first search whether a suitable compression coding rule exists based on the compression ratio conditions. If a suitable compression coding rule does not exist, the normal compression coding process is initiated to compress the object to be compressed. If a suitable compression coding rule exists, the compression coding rule can be used directly for compression. Therefore, there is no need to consume time and resources to calculate a suitable compression coding rule for the current object to be compressed, thereby significantly improving compression efficiency.

[0035] Implementations of this specification are described in detail below. Figure 1 is a flowchart of a method according to an exemplary implementation of this specification. The method includes the following steps.

[0036] Step 102: Obtain the object to be compressed.

[0037] Step 104: Search for recommended records to see if there are any recommended compression coding rules that meet the compression ratio criteria. Recommended records are used to record the compression coding rules and corresponding compression ratio information of previously compressed objects, and the previously compressed objects are of the same type as the object to be compressed.

[0038] Step 106: If a recommended compression coding rule exists that satisfies the compression ratio requirements, compress the object to be compressed using that recommended compression coding rule.

[0039] Step 108: If no recommended compression coding rule exists that satisfies the compression ratio criteria, start the normal compression coding process to obtain the estimated compression ratios of multiple compression coding rules for the object to be compressed, select a target compression coding rule based on at least the estimated compression ratio, and compress the object to be compressed by using the target compression coding rule.

[0040] This implementation is applicable to various data compression scenarios, and the object to be compressed can be of various types, such as data tables, video files, audio files, or images. No restrictions are imposed on this implementation. A previously compressed object is an object of the same type as the object to be compressed, but compressed before the object to be compressed. In the case of different objects to be compressed, the fact that the previously compressed object is of the same type as the object to be compressed means that there can be multiple implementations. For example, the object to be compressed and the previously compressed object may be data generated at different times and of different versions, such as a program source file. Alternatively, the object to be compressed and the previously compressed object may be data of the same format, such as video format data. Alternatively, the object to be compressed and the previously compressed object may be child data belonging to the same parent data, such as multiple child data obtained by splitting the original data, or multiple data units belonging to the same data table.

[0041] For example, suppose object data A1 needs to be compressed now. During the initial compression, no previously compressed objects of the same type as the object to be compressed can be referenced, the recommended record is empty, and there are no recommended compression coding rules that satisfy the compression ratio criteria. In this case, the normal compression coding process is initiated to obtain estimated compression ratios for multiple compression coding rules for the object to be compressed, a target compression coding rule is selected based on at least the estimated compression ratio, and the object to be compressed is compressed by using the selected target compression coding rule.

[0042] After the initial compression is complete, the recommendation record is updated based on the compression coding rule used for the object data-A1 to be compressed. For example, the current compression can be recorded in the recommendation record. Specifically, the target compression coding rule and the compression ratio information for the target compression coding rule can be written to the recommendation record. In some cases, the estimated compression ratio of the compression coding rule for the object to be compressed is not significantly different from the actual compression ratio, and the compression ratio information may be information about the estimated compression ratio. In some other cases, the estimated compression ratio of the compression coding rule for the object to be compressed may differ to some extent from the actual compression ratio. The actual compression ratio may be lower or higher than the estimated compression ratio. Therefore, in this implementation, the actual compression ratio of the object to be compressed can be further obtained, and the recommendation record is updated based on at least the actual compression ratio and the compression coding rule used for the object to be compressed. In this case, the compression ratio information may be information about the actual compression ratio. The actual compression ratio can be obtained through calculation after data-A1 has been compressed, for example, the actual compression ratio is determined based on the ratio of the actual size of the compressed object to the original size of the object to be compressed. Compression ratio information can be the compression ratio value itself, or it can be related information obtained by converting the compression ratio using a specified method as needed.

[0043] Subsequently, new data, Data-A2, is generated based on Data-A1, for reasons such as the insertion of new data into Data-A1. In the current compression process, Data-A2 is used as the new object to be compressed, and Data-A2 is of the same type as Data-A1. For processes that use Data-A1 as a previously compressed object, refer to the compression process for Data-A1.

[0044] In the compression process for Data-A2, a compression coding rule can be selected by referencing the compression of Data-A1. For example, a compression coding rule for Data-A1 and its corresponding compression ratio information are recorded in a recommendation record, and based on the compression ratio conditions, it can be decided whether to select the compression coding rule for Data-A1 for coding. Once a compression coding rule for Data-A1 is selected, Data-A2 can be directly compressed by using that compression coding rule. The compression ratio conditions in this implementation can be flexibly set as needed. For example, some objects to be compressed have specific compression ratio requirements, and different compression coding rules will achieve different compression effects on the compressed objects. In this implementation, compression ratio conditions can be set to select an appropriate recommended compression coding rule. For example, the compression ratio condition may include that the compression ratio information of the recommended compression coding rule is higher than a specified threshold. Specific thresholds can be flexibly configured based on actual needs. No restrictions are imposed in this implementation.

[0045] In the Data-A2 compression process, the compression coding rules of previously compressed objects can be selected for compression. This eliminates the need to spend time and resources calculating appropriate compression coding rules for the current object being compressed, thereby significantly improving compression efficiency.

[0046] Indeed, if the compression coding rule for data-A1 does not satisfy the compression ratio conditions for data-A2, the normal compression coding process can be specified for data-A2 as well to obtain the estimated compression ratio of multiple compression coding rules for the object to be compressed, at least based on the estimated compression ratio, a target compression coding rule will be selected, and the object to be compressed will be compressed by using the target compression coding rule.

[0047] After Data-A2 is compressed, the recommended record can be further updated based on the compression of Data-A2. For example, in a real service, new data is inserted into Data-A1, so the compression coding rules for the original Data-A1 may not be applicable to Data-A2. Therefore, in this implementation, the actual compression ratio of Data-A2 can be obtained, and the recommended record is updated based on at least the actual compression ratio and compression coding rules used for Data-A2. For example, the recommended record is updated with the compression coding rules used for Data-A2, and the compression ratio information is updated based on the actual compression ratio of Data-A2.

[0048] In actual compression service scenarios, there may be issues with large amounts of data to be compressed. Therefore, the data to be compressed can be divided into multiple data units as needed. In service scenarios, the data compression method in this implementation can be used to improve compression efficiency. For example, a compressed object may contain data units obtained by dividing the data to be compressed, and a previously compressed object may contain other previously compressed data units obtained by dividing the data to be compressed.

[0049] For example, as shown in Figure 2, data-B can be divided into a total of four data units to be compressed, from data-b1 to data-b4.

[0050] For example, in the case of data-b1, during the initial compression, previously compressed objects cannot be referenced, the recommended record is empty, and there are no recommended compression coding rules that satisfy the compression ratio conditions. Thus, the normal compression coding process is initiated to obtain estimated compression ratios for multiple compression coding rules for the object to be compressed, a target compression coding rule is selected based on at least the estimated compression ratio, and the object to be compressed is compressed by using the selected target compression coding rule.

[0051] Once the initial compression is complete, the current compression can be recorded in the recommendation record. For example, the target compression coding rule and the compression ratio information for the target compression coding rule are recorded in the recommendation record. Subsequently, in the compression process for data-b2, the compression coding rule can be selected by referring to the compression of data-b1. For example, the compression coding rule for compressing data-b1 and its corresponding compression ratio information are recorded in the recommendation record, and based on the compression ratio conditions, it can be decided whether to select the compression coding rule for data-b1 for coding. If the compression coding rule for data-b1 is selected, data-b2 can be compressed directly by using the compression coding rule. After data-b2 is compressed, it is decided whether to update the recommendation record based on the compression. Then, the same process can be used to implement fast compression for data-b3 and data-b4 and update the recommendation record.

[0052] In several other examples, data table compression scenarios are used as illustrations. In distributed relational databases with LSM tree structures, data tables may be compressed multiple times, and each time new data is retrieved, it needs to be compressed. The solution in this implementation eliminates the need to repeatedly estimate the compression ratio of multiple compression coding rules, and can build a recommendation record by using compression knowledge of previously compressed objects. In subsequent other compression processes, compression coding rules can be quickly retrieved based on the recommendation record.

[0053] For example, the data in table C currently needs to be compressed. Because table C contains a large amount of data to be compressed, the data in table C is divided into a total of five data units to be compressed, from table-c1 to table-c5, based on the specified data unit size.

[0054] In the case of table c1, during the initial compression, previously compressed objects cannot be referenced, and the recommended record is empty. Thus, the normal compression coding process is initiated to obtain estimated compression ratios for multiple compression coding rules for the object to be compressed. A target compression coding rule is then selected based on at least the estimated compression ratio, and the object to be compressed is compressed using the selected target compression coding rule.

[0055] Once the initial compression is complete, the current compression can be recorded in the recommendation record. For example, the target compression coding rule and the compression ratio information for the target compression coding rule are recorded in the recommendation record. Subsequently, in the compression process for table-c2, a compression coding rule can be selected by referring to the compression of table-c1. For example, the compression coding rule for compressing table-c1 and its corresponding compression ratio information are recorded in the recommendation record, and based on the compression ratio conditions, it can be decided whether to select the compression coding rule for table-c1 for coding. If a compression coding rule for table-c1 is selected, table-c2 can be compressed directly by using the compression coding rule. After table-c2 is compressed, it is decided whether to update the recommendation record based on the compression. Then, the same process can be used to implement fast compression and update the recommendation records for tables-c3 and table-c4.

[0056] In some other examples, a data table typically contains multiple data columns with different attributes, and these data columns can be very different, allowing for the use of different compression coding rules for different data columns within the data table. For example, in a user data table, the data column for the user's age is integer data, while the data column for the username is string data. An appropriate compression coding rule can be selected for each of these two data columns. Based on this, the object to be compressed in this implementation can include the columns of data to be compressed within the data table, and previously compressed objects can include previously compressed data in the same columns as the data to be compressed within the data table. Based on this, column data can be distinguished in this implementation. The compression coding rules recorded in the recommended record correspond to the data in the column. If a particular column of data in the data table needs to be compressed after it has been compressed, and then updated, and subsequently needs to be compressed again, the column can be compressed based on the compression coding rule of the previously compressed data in the same column of the data table, thereby improving compression efficiency.

[0057] In some other examples, the object being compressed contains a copy of the data table, and the previously compressed object contains the main data table corresponding to the copy of the data table.

[0058] Similar to the example above, the copy of the data table is identical in content to the main data table, and the copy of the data table is consistent with the main data table. Typically, after the main data table is updated, the copy of the data table is updated based on the update operation of the main data table. If the main data table is compressed after an update, then compressed storage must be performed for the copy of the updated data table. Based on this, the copy of the data table can be compressed directly by using the compression coding rules of the main data table in the compression process of the copied data table. Optionally, if the copy of the data table is to be compressed, the compression coding rules of the main data table and the corresponding compression ratio information are retrieved from the recommended records, and the compression coding rules of the main data table are used directly for compression. Indeed, in actual service, the actual compression ratio of the compression coding rules selected for the main data table may not be high. If a recommended record is retrieved, it can be determined, if necessary, that there are no recommended compression coding rules that meet the compression ratio criteria in the recommended records, and the normal compression coding process can be initiated to compress the copy of the data table.

[0059] In actual service, there is a single criterion for evaluating compression coding rules. Compression coding rules are selected primarily based on the final actual compression ratio, without comprehensively considering the additional overhead introduced by different compression coding rules for different data access modes under actual database service load conditions. For example, assume a relatively high data compression ratio. In this case, since compression ratio and decompression efficiency are negatively correlated, the corresponding data decompression process will take longer. If data needs to be accessed frequently, the data compression ratio is relatively high, and the data needs to be decompressed each time it is accessed, a relatively large additional decompression overhead is unavoidable. Based on this, in this implementation, access requirement information for the object to be compressed is obtained. Access requirement information can be correlated with decompression efficiency. For example, higher access requirements result in higher corresponding decompression efficiency. Access requirement information can be configured by service stakeholders, data accessers, or determined by collecting historical access data for one or more of the object to be compressed or previously compressed objects. For example, assume that previously compressed objects are accessed frequently. In this case, the data access requirements will be relatively high, and the corresponding decompression efficiency requirements will be relatively high. Based on this, relatively high access requirements can be set. If historically compressed objects are rarely accessed, the data access requirements will be relatively low, and the corresponding decompression efficiency requirements will also be relatively low. Based on this, relatively low access requirements can be set. Optionally, historical access information for the object to be compressed can be obtained based on historically compressed data of the same type as the object to be compressed. Optionally, historical access information can include the type of data operation, such as adding, deleting, modifying, or querying, or the number of operations performed on the data within a specific period, or the frequency of past access.

[0060] In response to this, the compression ratio requirement includes ensuring that the compression ratio information of the recommended compression coding rule matches the access requirement information. This allows for the selection of an appropriate compression coding rule based not only on a single factor such as compression ratio, but also on the compression ratio requirement and access requirement information, so that the final compressed data can better meet the service's access requirements.

[0061] Based on this, once access requirements information is obtained, the recommendation record can be further updated based on the actual compression ratio, access requirements information, and the compression coding rules used for the object being compressed. For example, in the initial stages of compression, the compression coding rules used for the object being compressed begin to be written to the recommendation record. Subsequently, if the content of the data being compressed changes or access to the data changes, the compression coding rules used for the object being compressed may change. Based on this, the recommendation record needs to be updated dynamically. Optionally, in this implementation, the compression ratio information for the compression coding rules is implemented using confidence, where confidence represents the actual compression ratio of the compression coding rule, allowing the confidence of the compression coding rule to be dynamically adjusted in the ongoing compression process. Optionally, updating the compression coding rules and corresponding compression ratio information in the recommendation record based on the actual compression ratio and access requirements information in this implementation may include increasing the confidence of the recommended compression coding rule if the compression coding rule used for the object to be compressed is the recommended compression coding rule recorded in the recommendation record and the actual compression ratio of the object to be compressed matches the access requirements information; decreasing the confidence of the recommended compression coding rule if they do not match; or replacing the recommended compression coding rule in the recommendation record with the compression coding rule used for the object to be compressed if the compression coding rule used for the object to be compressed is different from the recommended compression coding rule recorded in the recommendation record and the actual compression ratio of the object to be compressed matches the access requirements information.

[0062] From the above implementation, it can be seen that when an object to be compressed is compressed using the recommended compression coding rule recorded in the recommendation record, after the compression is complete, whether the recommended compression coding rule is appropriate is checked using the actual compression ratio and access requirement information. If the actual compression ratio of the object to be compressed matches the access requirement information, the confidence of the recommended compression coding rule increases, and the compression coding rule can be continued to be used in subsequent compressions. If they do not match, the confidence of the recommended compression coding rule decreases, and if the confidence of the recommended compression coding rule falls below a threshold, the normal compression coding process can be started to obtain a new compression coding rule. If there is no suitable compression coding rule in the recommendation record, the normal compression coding process is started to obtain a new compression coding rule. The new compression coding rule will be different from the recommended compression coding rule recorded in the recommendation record. Similarly, after the compression is complete, whether the recommended compression coding rule is appropriate is checked using the actual compression ratio and access requirement information. If the actual compression ratio of the object to be compressed matches the access requirements information, the recommended compression coding rule in the recommendation record can be replaced with the compression coding rule used for the object to be compressed, and the latest appropriate compression coding rule can be recorded in the recommendation record for use in subsequent compressions. If the actual compression ratio of the object to be compressed does not match the access requirements information, the compression coding rule may not be applied, and the recommendation record may not be updated.

[0063] The data compression methods described herein will be explained again using implementation examples. Distributed databases such as LSM tree architectures have a relatively large number of data compression requirements. As an example, a relational database with an LSM tree architecture is used. Like a tree structure, an LSM tree has a multi-layer structure in which the data size of the upper layers is smaller than that of the lower layers. First, the C0 layer, which resides in memory, stores all recently written data tuple records. The memory structure is orderly, and in-place updates can be implemented and queries can be performed at any time. The remaining C1 to Cn layers all reside on disk.

[0064] The data writing process is as follows: When a data write operation occurs, the data write operation is first appended to the write-ahead log (i.e., the log recorded before the actual write), and then appended to the C0 layer. When the data in the C0 layer reaches a certain size, the C0 and C1 layers are compressed. This is similar to merge sort. The process is compression. The data in the new C1 layer obtained through compression is sequentially written to disk, replacing the original old C1 layer. When the C1 layer reaches a certain size, compression with the lower layers continues. After compression, all old files can be deleted, leaving only the new files.

[0065] When compression is performed, the data needs to be shuffled and rewritten to a new storage file, and this new storage file needs to be compressed.

[0066] Between compression and writing, the data is divided into independent compression units based on a fixed size specified by the user when creating the table, and row and column hybrid coding compression is performed on the data in each compression unit.

[0067] Furthermore, in a distributed database scenario, each data table has multiple data copies. To ensure the availability of service access and user stability, the compression times of the main data table and the data table copies need to be staggered to provide services alternately.

[0068] For example, a statistics module configured in a database can retrieve access data for home tables. Based on this access data, the access requirements for data tables can be quickly determined. For instance, a higher number of data point query operations and more data table accesses indicate high access latency requirements for data tables based on past access operation types and past access frequency. Storage resources saved by data compression may not compensate for the overhead of additional computing resources, requiring the use of compression coding rules with relatively low compression ratios to ensure user access speed. A low number of data point query operations and infrequent accesses to data tables indicate that users are not executing queries frequently and access latency requirements are low. In this case, compression rules with relatively high compression ratios can be used.

[0069] In the data table compression process, a large number of data units need to be compressed, and a recommended record containing the current best compression coding rule is maintained. Each time, after the compression coding rule for the object to be compressed is selected, the compression coding rule is updated in the recommended record, and the recommended record maintains the current and most up-to-date compression coding rule.

[0070] Before each data unit to be compressed is compressed, the data table accepts two pieces of input information: access requirements information for the data table and the most recent recommended record.

[0071] If the recommended record is empty, or if the confidence level of the recommended record's compression coding rule does not reach the specified threshold, the normal compression coding process is initiated to detect data units to be compressed. Data feature analysis is performed on each column of data using the normal compression coding process. Compression rates are sorted based on the estimated compression rates of different compression coding rules. Based on the access requirements information of the data table, the compression coding rule that satisfies the access requirements of the data table is ultimately selected from among several compression coding rules with different estimated rates as the compression coding rule to be used for current compression. A data unit to be compressed may contain multiple columns of different types of data. A corresponding compression coding rule can be selected for each column of data; that is, a data unit to be compressed can support two or more compression coding rules.

[0072] If the confidence level of the compression coding rule in the recommended record exceeds a specified threshold, the compression coding rule in the recommended record can be used directly as the compression coding rule for the current data unit to be compressed. The data unit to be compressed may contain multiple columns of different types of data, and a corresponding compression coding rule can be selected for each data column. In some examples, a compression coding rule can be selected for some data columns from the recommended record, but not for others. In this case, the normal compression coding process can be initiated for detection and selection.

[0073] After one or more compression coding rules are selected for the current data unit to be compressed, the data within the data unit is scanned column by column, and compression is initiated using the selected compression coding rules. After the compression of each data column is complete, the actual compression ratio information for each data column is obtained.

[0074] Once the overall compression of the data unit is complete, the recommended record is updated based on the actual compression ratio information for each data column. If the compression coding rule used for the object to be compressed is the recommended compression coding rule recorded in the recommendation record, and the actual compression rate of the object to be compressed matches the access requirements information, the confidence level of the recommended compression coding rule is increased; otherwise, the confidence level of the recommended compression coding rule is decreased, or If the compression coding rule used for the object to be compressed differs from the recommended compression coding rule recorded in the recommendation record, and the actual compression ratio of the object to be compressed matches the access requirements information, replace the recommended compression coding rule in the recommendation record with the compression coding rule used for the object to be compressed.

[0075] In the case of a data table replication compression process, data compression can be directly performed on the data table replication by using the same compression coding rules as the main data table that contains identical data.

[0076] From the above implementation, it can be seen that the data compression method in this implementation uses a semi-supervised learning method, and therefore, more historical contextual information is fully utilized to adjust the compression coding rules in the recommended records in a timely manner.

[0077] For the main data table, the compression process requires starting the normal compression coding process on the data unit to be compressed in the pre-compression stage to detect each compression coding rule in order to obtain the latest recommended record through training. After the confidence level exceeds a specified threshold, subsequent data units to be compressed can start compression directly by using the compression coding rule in the recommended record, eliminating the need for complex rule detection. Since the data in a copy of the data table is exactly the same as the data in the main data table, the copy of the data table can be compressed directly by using the final recommended coding rule of the main data table, thereby eliminating the overhead of detecting all compression coding rules.

[0078] In this implementation, during the selection of a compression coding rule for each data unit to be compressed, the access requirements of the data table to which the data unit belongs are optimized, and the final selection of the compression coding rule is adjusted based on access delay requirements. Furthermore, the selection of a compression coding rule for each data unit to be compressed is not simply based on the data of the current unit. Instead, the optimal compression coding rule is considered comprehensively by using the actual compression ratio results of a larger number of compression units.

[0079] The above implementation proposes a data compression method for distributed relational databases with an LSM tree structure. A semi-supervised multilevel feedback mechanism is implemented in combination with hybrid compression coding rules for database rows and columns. By using access requirement information from data tables, the specific selection of compression coding rules for each data unit is modified, and the compression process can continuously learn appropriate compression rules to update recommended records, thereby accelerating the selection of coding rules throughout the compression process. In the case of data tables, the speed of detecting compression coding rules can be greatly improved based on other data replicas within the distributed database cluster and knowledge of local historical compression data, and the compression ratio is improved while maximizing data access speed by adaptively selecting appropriate compression coding rules based on the data access requirement characteristics.

[0080] In particular, when duplicating data tables, recommended records from the main data table can be used directly, thus skipping the resource-intensive rule recommendation and training process. Generally, more contextual information can be fully utilized to accelerate and optimize the detection of compressed coding rules.

[0081] In accordance with the implementation forms of the data compression method, this specification further provides implementation forms of the data compression device and the computer device to which the device is applied.

[0082] The implementations of the data compression device described herein can be applied to computer devices such as servers and terminal devices. The device can be implemented using software, hardware, or a combination of hardware and software. The software implementation is used as an example. As a logical device, the device is formed by a file processing processor on which the device is located, reading corresponding computer program instructions from non-volatile memory into memory. Regarding hardware, Figure 3 is a hardware structure diagram showing a computer device on which the data compression device described herein is located. In addition to the processor 310, memory 330, network interface 320, and non-volatile memory 340 shown in Figure 3, the server or electronic device on which the device 331 is located in its implementation may typically include other hardware based on the actual functionality of the computer device. For simplicity, details are omitted here.

[0083] Figure 4 is a block diagram showing a data compression device according to an exemplary implementation of this specification. An acquisition module 41 configured to acquire objects to be compressed, A search module 42 configured to search for recommended records to determine if there are recommended compression coding rules that satisfy compression ratio conditions, wherein the recommended records are used to record the compression coding rules and corresponding compression ratio information of previously compressed objects, and the previously compressed objects are of the same type as the object to be compressed, A first compression module 43 is configured to compress the object to be compressed by using a recommended compression coding rule that satisfies the compression ratio conditions, if such a recommended compression coding rule exists. The system includes a second compression module 44 configured to initiate a normal compression coding process to obtain estimated compression ratios for multiple compression coding rules for the object to be compressed if no recommended compression coding rule exists that satisfies the compression ratio requirements, select a target compression coding rule based on at least the estimated compression ratios, and compress the object to be compressed by using the target compression coding rule.

[0084] Optionally, a compressed object includes compressed data units obtained by splitting the compressed data, and a previously compressed object includes other previously compressed data units obtained by splitting the compressed data.

[0085] Optionally, a compressed object includes compressed data units obtained by splitting the compressed data within a data table, and a previously compressed object includes previously compressed data units corresponding to the data table.

[0086] Optionally, objects to be compressed include columns of data to be compressed in the data table, and previously compressed objects include previously compressed data in the same columns as the data to be compressed in the data table.

[0087] Optionally, objects to be compressed include copies of their data tables, and previously compressed objects include the main data tables corresponding to the copies of their data tables.

[0088] Optionally, the compression ratio condition includes at least that the compression ratio information of the recommended compression coding rule is higher than the specified threshold.

[0089] Optionally, the device may further include an update module configured to update a recommended record based on the compression coding rules used for the object after the object has been compressed.

[0090] Optionally, the update module is: It is further configured to obtain the actual compression ratio of the object to be compressed and to update the recommendation record based on at least the actual compression ratio and the compression coding rules used for the object to be compressed.

[0091] Optionally, the acquisition module may be further configured to obtain access requirement information for the object to be compressed, the access requirement information relating to the decompression efficiency of the object to be compressed. The compression ratio requirement includes ensuring that the compression ratio information of the recommended compression coding rules matches the access requirement information.

[0092] Optionally, a second compression module is included. It is further configured to select a target compression coding rule based on estimated compression ratio and access requirements information.

[0093] Optionally, access requirements information for objects to be compressed can be determined by retrieving historical access data for one or more objects to be compressed or objects that have been compressed in the past.

[0094] Optionally, past access data may include past access frequency.

[0095] Optionally, the update module is: It is further configured to update the recommended record based on the actual compression ratio, access requirements information, and the compression coding rules used for the objects to be compressed.

[0096] Optionally, the compression ratio information for the compression coding rule includes a confidence level indicating the actual compression ratio of the compression coding rule. The update module is, If the compression coding rule used for the object to be compressed is the recommended compression coding rule recorded in the recommendation record, and the actual compression rate of the object to be compressed matches the access requirements information, the confidence level of the recommended compression coding rule will be increased; otherwise, the confidence level of the recommended compression coding rule will be decreased, or The system is further configured to replace the recommended compression coding rule in the recommendation record with the compression coding rule used for the object being compressed if the compression coding rule used for the object differs from the recommended compression coding rule recorded in the recommendation record, and the actual compression rate of the object being compressed matches the access requirements information.

[0097] Accordingly, this specification further provides a computer device comprising memory, a processor, and a computer program stored in memory and executable on the processor, wherein the processor implements an implementation of a data compression method when the program is executed.

[0098] For the implementation process of the functions and roles of each module in a data compression device, refer to the implementation process of the corresponding steps in the data compression method. For simplicity, details are omitted here.

[0099] Since the implementation configurations of the data compression device correspond to the implementation configurations of the data compression method, relevant parts can be referred to in the relevant descriptions in the implementation configurations of the method. The above device implementation configurations are merely examples. Modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, may be located in one place, or may be distributed across multiple network modules. Some or all of the modules may be selected based on the actual needs to achieve the objectives of the solutions herein. Those skilled in the art will be able to understand and implement the implementation configurations herein without creative effort.

[0100] Specific implementations of this specification are described above. Other implementations are within the scope of the appended claims. In some cases, the actions or steps described in the claims may be performed in an order different from that in the implementation, and the desired results may still be achieved. Furthermore, the processes shown in the appended drawings do not necessarily require a specific execution order to achieve the desired results. In some implementations, multitasking and parallel processing may be advantageous.

[0101] Those skilled in the art will readily understand, after considering and implementing this Specified, solutions for other implementations thereof. This Specified is intended to cover any modifications, uses, or adaptations thereof, which, in accordance with the general principles of this Specified, include common sense or prior art not disclosed in the art herein. This Specified and its implementations are considered merely examples, and the actual scope and intent of this Specified are indicated by the following claims.

[0102] This specification is not limited to the exact structures shown above and in the drawings, and it should be understood that various modifications and changes may be made without departing from the scope of this specification. The scope of this specification is limited only by the appended claims.

[0103] The above description is merely a preferred implementation of this specification and is not intended to limit this specification. Any modifications, equivalent substitutions, or improvements made without departing from the spirit and principles of this specification shall be protected within the scope of this specification. [Explanation of Symbols]

[0104] 41 acquired modules 42 Search Modules 43. First Compression Module 44. Second Compression Module 310 Processor 320 network interfaces 330 memory 331 Equipment 340 Non-volatile memory

Claims

1. Steps to obtain the object to be compressed, A step of searching for a recommendation record to determine whether there exists a recommended compression coding rule corresponding to compression ratio information that satisfies the compression ratio conditions set for the object to be compressed, wherein the recommendation record is used to record the compression coding rules of previously compressed objects and the compression ratio information corresponding to the compression coding rule, and the previously compressed object is of the same type as the object to be compressed. If a recommended compression coding rule exists that corresponds to compression ratio information that satisfies the compression ratio conditions, the step of compressing the object to be compressed by using the recommended compression coding rule, If there is no recommended compression coding rule that corresponds to the compression ratio information that satisfies the compression ratio conditions, the normal compression coding process is initiated to obtain the estimated compression ratio of multiple compression coding rules for the object to be compressed, a target compression coding rule is selected based on at least the estimated compression ratio, and the object to be compressed is compressed by using the target compression coding rule. A data compression method performed by a computer, including [specific data compression method].

2. The data compression method according to claim 1, wherein the object to be compressed is a data unit to be compressed from among a plurality of data units obtained by dividing the data to be compressed, and the previously compressed object is another previously compressed data unit from among the plurality of data units obtained by dividing the data to be compressed.

3. The data compression method according to claim 1, wherein the object to be compressed is a data unit to be compressed from among a plurality of data units obtained by dividing the data to be compressed in a data table, and the previously compressed object is a previously compressed data unit corresponding to the data table.

4. The data compression method according to claim 1, wherein the object to be compressed is a column of data to be compressed in a data table, and the previously compressed object has previously compressed data in the same column as the column of data to be compressed in the data table.

5. The data compression method according to claim 1, wherein the object to be compressed is a copy of a data table, and the previously compressed object is the main data table corresponding to the copy of the data table.

6. The data compression method according to claim 1, wherein the compression ratio condition includes at least the compression ratio information of the recommended compression coding rule being higher than a specified threshold.

7. The data compression method according to claim 1, further comprising the step of compressing the object to be compressed and then updating the recommended record based on the compression coding rules used for the object to be compressed and the actual compression ratio of the object to be compressed.

8. The step of updating the recommended record based on the compression coding rules used for the object to be compressed, A data compression method according to claim 7, comprising the steps of: obtaining the actual compression ratio of the object to be compressed; and updating the recommended record based on at least the actual compression ratio and the compression coding rule used for the object to be compressed.

9. The aforementioned data compression method A step of obtaining access requirement information for the object to be compressed, further comprising a step of setting the access requirement information in relation to the decompression efficiency of the object to be compressed and using it to select the compression coding rule, The data compression method according to claim 1, 6, or 8, wherein the compression ratio condition includes that the compression ratio indicated in the compression ratio information corresponding to the recommended compression coding rule matches the access requirement information.

10. The step of selecting the target compression coding rule based on at least the estimated compression ratio, The step of selecting the target compression coding rule based on the estimated compression ratio and the access requirements information. The data compression method according to claim 9, including the method described in claim 9.

11. The data compression method according to claim 9, wherein the access requirement information of the object to be compressed is determined by obtaining one or more past access data from the object to be compressed or the previously compressed object.

12. The data compression method according to claim 11, wherein the past access data includes past access frequency.

13. The step of updating the recommended record based at least on the actual compression ratio and the compression coding rules used for the object to be compressed, The step of updating the recommended record based on the actual compression ratio, the access requirements information, and the compression coding rules used for the object to be compressed. Includes, The compression ratio information of the past compressed objects using the compression coding rule includes a confidence level indicating the actual compression ratio of the compression coding rule. The step of updating the recommended record based on the actual compression ratio, the access requirements information, and the compression coding rules used for the object to be compressed, The compression coding rule used for the object to be compressed is the recommended compression coding rule recorded in the recommendation record, and if the actual compression ratio of the object to be compressed matches the access requirements information, the confidence level of the recommended compression coding rule is increased; otherwise, the confidence level of the recommended compression coding rule is decreased. If the compression coding rule used for the object to be compressed is different from the recommended compression coding rule selected by the normal compression coding rule recorded in the recommendation record, and the actual compression ratio of the object to be compressed matches the access requirements information, then the step of replacing the compression coding rule of a past compressed object having the same type as the object to be compressed in the recommendation record with the compression coding rule used for the object to be compressed. including, The data compression method according to claim 9, referencing claim 8.

14. An acquisition module configured to acquire an object to be compressed, A search module configured to search for recommendation records to determine whether there are recommended compression coding rules that satisfy the compression ratio conditions set for the object to be compressed, wherein the recommendation records are used to record the compression coding rules of previously compressed objects and the compression ratio information corresponding to the compression coding rules, and the previously compressed objects are of the same type as the object to be compressed, If a recommended compression coding rule that satisfies the compression ratio conditions exists, a first compression module is configured to compress the object to be compressed by using the recommended compression coding rule, If no recommended compression coding rule satisfies the compression ratio conditions, a second compression module is configured to initiate a normal compression coding process to obtain the estimated compression ratio of multiple compression coding rules for the object to be compressed, select a target compression coding rule based on at least the estimated compression ratio, and compress the object to be compressed by using the target compression coding rule. A data compression device equipped with the following features.

15. A computer device, Processor and A memory containing a computer program for causing the processor to execute the data compression method according to any one of claims 1 to 13, Computer devices.