Data lake-based data read / write methods, data read / write devices, and storage media

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By introducing primary key index information and optimizing the format in the log files of the data lake, the IO overhead and storage space issues during data ingestion are resolved, resulting in faster data ingestion speed and more efficient storage management.

CN116521641BActive Publication Date: 2026-06-30ZHEJIANG DAHUA TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: ZHEJIANG DAHUA TECH CO LTD
Filing Date: 2023-01-18
Publication Date: 2026-06-30

Application Information

Patent Timeline

18 Jan 2023

Application

30 Jun 2026

Publication

CN116521641B

IPC: G06F16/182; G06F16/18; G06F16/13

AI Tagging

Technology Topics

Data ingestion Record

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Network communication methods
DE102021206903B4Transmission Electric digital data processing MicrocontrollerCommunication endpoint
Intersection entity extraction method and system based on large-scale vector road data set
CN122019826BVectoral format still image dataSpecial data processing applications Data ingestion Data set
Warehouse entry and exit method and system
CN122264688AStorage devices Instruments Data ingestion Operations research
A method and system for wire grasping based on a robotic arm
CN117226834BData ingestion Robotic arm
Method and system for three-dimensional digital reconstruction and virtual interactive display of museum collections
CN122265537A3D modelling Data ingestionInteractive displays

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In existing technologies, data lakes suffer from high IO overhead and severe write amplification when storing and ingesting data, which leads to slower data ingestion speed and increased storage space consumption. In particular, when adding new records in the MOR table format, the entire base file needs to be rewritten.

Method used

By adding primary key index information, including Bloom filters and primary key lists, to the log files, the log file format is optimized so that new records can be directly appended to the log files without rewriting the base file, and the log files are periodically compressed into the base file.

Benefits of technology

It reduces storage space usage and I/O overhead during data ingestion, improves the data ingestion speed of the data lake, achieves near real-time data ingestion capabilities, and reduces the pressure on the distributed storage system due to the number and size of files.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116521641B_ABST

Patent Text Reader

Abstract

This application discloses a data read / write method, data read / write device, and computer storage medium based on a data lake. The data read / write method includes: acquiring record data and acquiring the primary key of the record data; reading the first primary key list of base files for all filegroups in the data lake; determining whether the primary key of the record data exists in the first primary key list of any base file; if not, reading the second primary key list of log files for all filegroups in the data lake; determining whether the primary key of the record data exists in the second primary key list of any log file; if yes, writing the record data to the log file corresponding to the primary key in the second primary key list where the record data exists. This data read / write method can, by changing the log file format, give the primary keys of records in the log file the ability to be indexed, support writing new records to the log file, reduce IO overhead during data ingestion, and accelerate the ingestion speed of the data lake when ingesting new data.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data query technology, and in particular to a data reading and writing method, a data reading and writing device, and a computer storage medium based on a data lake. Background Technology

[0002] Data lakes play a crucial role in modern big data storage systems. They can store massive amounts of data of any type, including structured, semi-structured, and unstructured data. This data is continuously fed from different data sources into the DFS (Distributed File System) through the data lake. Hudi, as a data lake storage system, brings the ability for rapid data ingestion and deletion to the DFS.

[0003] However, the current index information can only be stored in the base file. When the table format is a MOR table, for a newly added record, the record cannot be appended to the log file. Instead, the entire base file must be rewritten, which has the disadvantages of higher IO overhead and write amplification. This will slow down the data ingestion speed and generate more files that occupy more storage space. Summary of the Invention

[0004] This application provides a data read / write method, a data read / write device, and a computer storage medium based on a data lake.

[0005] One technical solution adopted in this application is to provide a data read / write method based on a data lake, the data read / write method comprising:

[0006] Obtain the recorded data, and obtain the primary key of the recorded data;

[0007] Read the first primary key list of the underlying files of all filegroups in the data lake;

[0008] Determine whether the primary key of the recorded data exists in the first primary key list of any base file;

[0009] If not, read the second primary key list of log files for all filegroups in the data lake;

[0010] Determine whether the primary key of the recorded data exists in the second primary key list of any log file;

[0011] If so, the recorded data is written to the log file of the second primary key list corresponding to the primary key containing the recorded data.

[0012] The data read / write method further includes:

[0013] When the primary key of the recorded data exists in the first primary key list of any base file, the recorded data is written to the log file of the file group where the corresponding base file is located;

[0014] Alternatively, if the primary key of the recorded data does not exist in the first primary key list of any base file and does not exist in the second primary key list of any log file, the recorded data is written to the log file of the smallest filegroup in the data lake.

[0015] The step of determining whether the primary key of the recorded data exists in the second primary key list of any log file includes:

[0016] Get the range of primary key values for each log file;

[0017] A first log file that meets the primary key value condition is obtained based on the value of the primary key of the recorded data, wherein the primary key value condition is that the value of the primary key of the recorded data is within the range of the primary key value of the first log file;

[0018] Determine whether the primary key of the recorded data exists in the second primary key list of any first log file.

[0019] The step of determining whether the primary key of the recorded data exists in the second primary key list of any first log file includes:

[0020] Get the preset Bloom filter for each first log file;

[0021] A second log file that meets the filtering conditions is obtained based on the primary key of the recorded data, wherein the filtering condition is that the primary key of the recorded data is likely to exist in the output of the Bloom filter in the second log file;

[0022] Determine whether the primary key of the recorded data exists in the secondary primary key list of any secondary log file.

[0023] The step of writing the record data into a log file corresponding to the primary key list containing the record data includes:

[0024] Write the current data block to the end of the log file corresponding to the primary key list containing the primary key of the record data;

[0025] The footer information of the current data block is set according to the primary key of the recorded data, wherein the footer information includes a Bloom filter and a list of primary keys for the current log file.

[0026] The step of setting the footer information of the current data block according to the primary key of the recorded data includes:

[0027] Retrieve the footer information of the previous data block of the current data block in the current log file;

[0028] The primary key list in the footer information of the previous data block is updated using the primary key of the current data block to generate the footer information of the current data block.

[0029] The step of updating the primary key list in the footer information of the previous data block using the primary key of the current data block to generate the footer information of the current data block includes:

[0030] Retrieve the footer information of the previous data block;

[0031] If the footer information of the previous data block is missing or empty, obtain all data blocks of the current log file and extract the primary key of all data blocks;

[0032] Generate a primary key list for the current log based on the primary keys of all data blocks and the primary key of the current data block, and generate a Bloom filter based on the primary key list for the current log.

[0033] Generate the footer information for the current data block based on the primary key list of the current log and the Bloom filter.

[0034] The data read / write method further includes:

[0035] The log data of all file groups is compressed into their respective base files according to a preset cycle.

[0036] Another technical solution adopted in this application is to provide a data read / write device, which includes a memory and a processor coupled to the memory;

[0037] The memory is used to store program data, and the processor is used to execute the program data to implement the data read / write method described above.

[0038] Another technical solution adopted in this application is to provide a computer storage medium for storing program data, which, when executed by a computer, is used to implement the data read / write method described above.

[0039] The beneficial effects of this application are as follows: The data read / write device acquires record data and obtains the primary key of the record data; reads the first primary key list of the base files of all filegroups in the data lake; determines whether the primary key of the record data exists in the first primary key list of any base file; if not, reads the second primary key list of the log files of all filegroups in the data lake; determines whether the primary key of the record data exists in the second primary key list of any log file; if so, writes the record data to the log file corresponding to the second primary key list where the primary key of the record data exists. The data read / write method of this application can, by changing the log file format, give the primary keys of records in the log file the ability to be indexed, so that when adding a record, the base file does not need to be rewritten, reducing the storage space occupied during data ingestion. Adding records supports writing to the log file, reducing IO overhead during data ingestion and accelerating the ingestion speed of the data lake when ingesting new data. Attached Figure Description

[0040] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0041] Figure 1 This is a flowchart illustrating an embodiment of the data read / write method provided in this application;

[0042] Figure 2 This is a schematic diagram of the overall flow of the data reading and writing method provided in this application;

[0043] Figure 3 This is a schematic diagram of the structure of a file group in the data lake provided in this application;

[0044] Figure 4 yes Figure 1 The flowchart of the specific sub-steps of step S15 of the data read / write method is shown.

[0045] Figure 5 yes Figure 1 The flowchart of the specific sub-steps of step S16 of the data read / write method is shown.

[0046] Figure 6 This is a schematic diagram of an embodiment of the data read / write device provided in this application;

[0047] Figure 7 This is a schematic diagram of the structure of an embodiment of the computer storage medium provided in this application. Detailed Implementation

[0048] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of the embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0049] In current data lake implementations, a MOR table approach is used to achieve near real-time data retrieval capabilities. This table approach appends modification and deletion operations to a log file and writes newly added records to a new base file. The log file uses row-oriented storage similar to Avro, achieving high write performance due to its append-only capability, but poor query performance. The base file uses column-oriented storage such as Parquet. For newly added records, this file format does not support appending data to the file; the entire file must be rewritten to complete the record addition.

[0050] The drawbacks of rewriting the entire file include poor real-time write performance and severe write amplification. Currently, new data cannot be directly appended to the log file because whether a record already exists can only be determined by checking the primary key index information stored in the footer of the base file. However, the log file does not record primary key index information, making it impossible to determine whether a record is added or modified. This also leads to the data lake checking all base files to determine if a new record exists. If it doesn't exist, the data lake selects the base file with the smallest file size, reads its data into memory, and writes it along with the new data to a completely new base file.

[0051] Compared to directly appending files, this process has higher I / O overhead and write amplification, which slows down data ingestion and generates more files that occupy more storage space.

[0052] To address this, this application provides a data read / write method. This method involves adding record index information to the footer of the log file, allowing newly added records to be appended to the log file without rewriting the entire base file. This further improves the data ingestion speed of the data lake, enabling near real-time data ingestion capabilities when using MOR tables. Simultaneously, this design allows for better control over file size and quantity, thereby reducing the pressure on the distributed storage system from a large number of small files.

[0053] The following is a brief introduction to the technical terms used in this application:

[0054] 1. COW: Copy-on-Write, a table format in Hudi (a data lake storage system). It has high data ingestion latency and low query latency, and uses a columnar file format to store data.

[0055] 2. MOR: Read-time merge, a table format in Hudi, with low data ingestion latency but high query latency, using columnar and row-based file formats to store data.

[0056] 3. Hudi: A data lake implementation.

[0057] 4. Parquet: A columnar file format, commonly used as the basic file storage format in data lakes.

[0058] 5. Avro: A line-based file format, commonly used as the log file storage format in data lakes.

[0059] 6. DFS: Distributed File System.

[0060] 7. Record: A single piece of data in Hudi.

[0061] 8. Index: A mapping between the primary key of a record and the file path.

[0062] 9. Filegroup: Consists of a base file and several log files.

[0063] 10. Bloom filter: Returns false if the value definitely does not exist, and returns true if the value may exist.

[0064] Please refer to the above introduction. Figure 1 and Figure 2 , Figure 1 This is a flowchart illustrating an embodiment of the data read / write method provided in this application. Figure 2 This is a schematic diagram of the overall flow of the data reading and writing method provided in this application.

[0065] The data read / write method of this application is applied to a data read / write device, which can be a server or a system in which the server and the data read / write device cooperate with each other. Accordingly, the various parts of the data read / write device, such as various units, sub-units, modules, and sub-modules, can all be set in the server, or they can be set in the server and the data read / write device respectively.

[0066] Furthermore, the aforementioned server can be either hardware or software. When the server is hardware, it can be implemented as a distributed server cluster consisting of multiple servers, or as a single server. When the server is software, it can be implemented as multiple software programs or software modules, such as software or software modules used to provide distributed servers, or as a single software program or software module; no specific limitation is made here. In some possible implementations, the data read / write method of this application embodiment can be implemented by a processor calling computer-readable instructions stored in memory.

[0067] Specifically, such as Figure 1 As shown, the data read / write method of this application embodiment specifically includes the following steps:

[0068] Step S11: Obtain the record data and obtain the primary key of the record data.

[0069] In the embodiments of this application, such as Figure 2 As shown, when a data lake performs data ingestion, for a new record, it extracts its primary key and the primary key index information stored in the footer of all underlying files.

[0070] Step S12: Read the list of primary keys of the base files of all filegroups in the data lake.

[0071] Step S13: Determine whether the primary key of the recorded data exists in the first primary key list of any base file.

[0072] In this embodiment, the data read / write device compares the primary keys in the primary key list of all basic file footers in the data lake with the primary keys of the record data, and determines whether the primary key of the record data exists in the primary key list of one or more basic file footers. By detecting whether it exists, the device analyzes whether the record data is a modification operation or a new record operation.

[0073] Specifically, such as Figure 2 If the primary key of the recorded data exists in the primary key list of a certain base file, the data read / write device writes the recorded data into the filegroup of that base file, and then appends the modification operation of the recorded data to the log file of that filegroup. If the primary key of the recorded data does not exist in the primary key lists of all base files, then proceed to step S14.

[0074] Step S14: Read the list of secondary primary keys of log files for all filegroups in the data lake.

[0075] In this embodiment of the application, when the primary key of the recorded data does not exist in the primary key list of all base files, the data read / write device continues to extract the primary key index information, i.e., the primary key list, stored in the footer of the last Avro data block of all log files in the data lake.

[0076] Step S15: Determine whether the primary key of the recorded data exists in the second primary key list of any log file.

[0077] In the embodiments of this application, such as Figure 2 As shown, if the primary key of a record does not exist in the first primary key list of any base file and does not exist in the second primary key list of any log file, the data read / write device analyzes the record as a newly added record. In this case, the data read / write device selects the filegroup with the smallest base file in the data lake and then appends the operation record of the record to the log file of that filegroup.

[0078] If the primary key of the recorded data exists in the primary key list of a certain log file, the data read / write device still analyzes the recorded data as a modified record and proceeds to step S16.

[0079] Furthermore, the log file of this application adds a sorted list of primary keys to the content portion of the Avro data block to record the primary key information of the current data block. In addition, the log file of this application adds a Bloom filter and the minimum and maximum values of the primary keys within the current log file range to the footer to improve the efficiency of primary key queries.

[0080] Please continue reading for details. Figure 3 and Figure 4 , Figure 3 This is a schematic diagram of the structure of a file group in the data lake provided in this application; Figure 4 yes Figure 1 The flowchart shows the specific sub-steps of step S15 in the data read / write method.

[0081] like Figure 3 As shown, this application improves and optimizes the Avro data block in the log file, adds the function of recording the primary key of the record data in the Avro data block, and adds a Bloom filter to the footer of the Avro data block, as well as information on the minimum and maximum primary key values of the log file.

[0082] It should be noted that the primary key index information stored in the footer of the base file and the log file is of the same type: a Bloom filter, the maximum and minimum values of the primary key in the file.

[0083] Specifically, such as Figure 4 As shown, the data read / write method of this application embodiment specifically includes the following steps:

[0084] Step S151: Obtain the range of primary key values for each log file.

[0085] In the embodiments of this application, such as Figure 3The data read / write device can obtain the primary key value range of the log file from the footer of the last Avro data block in the log file, wherein the primary key value range is determined by the numerical range between the maximum and minimum primary key values.

[0086] Step S152: Obtain the first log file that meets the primary key value condition based on the value of the primary key of the record data, wherein the primary key value condition is that the value of the primary key of the record data is within the range of the primary key value of the first log file.

[0087] In this embodiment, the data read / write device filters out a first log file that meets the primary key value condition from all log files based on the value of the primary key of the recorded data. Specifically, by using the primary key value condition, the data read / write device can filter out some log files whose primary key value range does not include the value of the primary key of the recorded data, thereby reducing the workload of traversing log files and improving record query efficiency.

[0088] Step S153: Determine whether the primary key of the recorded data exists in the second primary key list of any first log file.

[0089] In this embodiment of the application, the data read / write device continues to query whether the primary key of the recorded data exists in the primary key list of a certain log file after filtering by primary key value conditions.

[0090] Furthermore, the data read / write device can also use the Bloom filter in the Avro data block footer to further filter the log files after filtering by primary key numerical conditions, thereby further improving the efficiency of primary key queries.

[0091] Specifically, the data read / write device inputs the primary key of the recorded data into a Bloom filter for each log file and obtains the output of the Bloom filter. If the Bloom filter returns "false", it means that the primary key of the recorded data definitely does not exist in the corresponding log file, thus filtering out that log file. If the Bloom filter returns "true", it means that the primary key of the recorded data may exist in the log file, and a primary key lookup is then performed on that log file.

[0092] In this embodiment of the application, for a record, by extracting the primary key value of the record and comparing it with the maximum and minimum value information stored in the footer and the Bloom filter, the base file and log file that definitely do not exist can be directly excluded, reducing the number of records to be compared one by one.

[0093] To address potential false positives from the Bloom filter, if the Bloom filter returns an error, it is necessary to further read all Avro data blocks in the log file and compare them one by one with the primary key information recorded in each block.

[0094] Step S16: Write the record data to the log file of the secondary primary key list corresponding to the primary key containing the record data.

[0095] In this embodiment of the application, if the recorded data exists in the primary key list of a certain log file, the data reading and writing device selects the file group of the underlying file and appends the recorded data to the log file in that file group.

[0096] Thus, during the entire data ingestion process, for the processing of newly added records, the data read / write device no longer needs to rewrite the entire base file; instead, it uses the operation of appending to the log file.

[0097] Specifically, although the indexing speed of the record information in the log file is slower than that of the base file, the data read and write device can periodically compress the record information of the log file into the base file through the data lake's compression service, which can effectively control the time overhead of this part.

[0098] Furthermore, the data read / write device stores the primary keys of the recorded data by appending Avro data blocks to the end of the log file. Specifically, this application redesigns the Avro data block format of the log file, adding a sorted list of primary keys to the new data block content, which records the primary key information of the current data block. The footer includes a Bloom filter and the minimum and maximum values of the primary keys within the current log file range.

[0099] Please continue reading for details. Figure 5 , Figure 5 yes Figure 1 The flowchart shows the specific sub-steps of step S16 in the data read / write method.

[0100] Specifically, such as Figure 5 As shown, the data read / write method of this application embodiment specifically includes the following steps:

[0101] Step S161: Write the current data block to the end of the log file of the second primary key list corresponding to the primary key containing the record data, to store the record data.

[0102] Step S162: Set the footer information of the current data block according to the primary key of the recorded data, wherein the footer information includes a Bloom filter and a list of primary keys of the current log file.

[0103] In this embodiment, when appending an Avro data block, the footer information of the previous Avro data block is checked, its maximum and minimum values and the sorted primary key list are extracted, and combined with the sorted primary key list of the current Avro data block, the updated maximum and minimum values are written to the footer, and the Bloom filter generated according to the sorted primary key list is written to the footer.

[0104] In this application, the new version of the Avro data block in the log file is designed to be compatible with the old version. When a new version of the Avro data block is added to an old version of the log file, if the footer data of the previous Avro data block is empty, all Avro data blocks in the file will be traversed, the primary key information will be extracted, and after sorting with the primary key information of the newly added Avro data block, the maximum value, minimum value, and Bloom filter generated based on the primary key information will be written to the footer of the Avro data block. If the footer data of the previous Avro data block is not empty, the subsequently appended Avro data block can calculate the Bloom filter and the maximum and minimum values of the primary key that it needs to set based on the information of the previous data block, without having to traverse the records in the entire file again.

[0105] It's important to note that when a new version of the Avro data block is not present in the log file, all Avro data blocks in the log file need to be traversed to determine if a record exists. The additional time overhead incurred here can be mitigated by running a data lake compression service once. Furthermore, the compatibility design for older log files means that the additional time overhead only occurs when a new version of the Avro data block is appended to the older log file for the first time; subsequent writes will not incur this overhead. In other words, this additional time overhead is a one-time event and can be largely ignored.

[0106] Thanks to the new version of Avro's data block design, which adds the function of separately storing the primary key content, during the comparison process, the data read / write device does not need to read the entire Avro data block. It can directly read the information of the primary key part in the block, reducing the IO overhead during the comparison process.

[0107] In summary, this application proposes a data ingestion optimization method based on a data lake, which can solve the above-mentioned problems, provide near real-time data ingestion capabilities, and alleviate write amplification to a certain extent, while controlling file size. In the data ingestion optimization method of this application, a sorted primary key list is added to the content portion of the Avro data block in the log file, recording the primary key information of the current data block; a Bloom filter and the minimum and maximum values of the primary keys within the current log file range are added to the footer portion.

[0108] The purpose of this is to enable the log files to be indexed, allowing the data lake to quickly determine whether an inserted record is a new addition or a modification through the log file footer. When the data lake performs data ingestion, the inserted records can be written directly to the log files without rewriting the entire base file. The data in the log files is subsequently integrated into the base files through the data lake's compression operations, further enhancing the real-time performance of data ingestion.

[0109] In this embodiment, the data read / write device acquires record data and its primary key; reads the first primary key list of the base files of all filegroups in the data lake; determines whether the primary key of the record data exists in the first primary key list of any base file; if not, reads the second primary key list of the log files of all filegroups in the data lake; determines whether the primary key of the record data exists in the second primary key list of any log file; if so, writes the record data to the log file corresponding to the primary key in the second primary key list where the record data exists. This data read / write method, by changing the log file format, enables the primary keys recorded in the log file to be indexable, eliminating the need to rewrite the base file when adding new records, reducing storage space usage during data ingestion, supporting writing of new records to the log file, reducing IO overhead during data ingestion, and accelerating the data lake's ingestion speed when ingesting new data.

[0110] The data read / write method proposed in this application addresses the issue of slow new data ingestion speed when ingesting data from a data lake. By appending new record operations to a log file, the ingestion speed is improved. Simultaneously, it avoids rewriting the base file for new records, better controlling file size and quantity, and reducing storage overhead.

[0111] Furthermore, the data read / write method in this application redesigns the existing log file format of the data lake, giving the primary keys of records in the log file the ability to be indexed; by changing the log file format, the required data is added to the Avro data block, while maintaining compatibility with the old version of the log file format, allowing for a smooth transition without requiring users to perceive the modification details; it avoids the introduction of external third-party components, thus preventing consistency issues; newly added records do not need to be read into the base file and rewritten, reducing storage space usage; and newly added records are appended to the log file, reducing IO overhead during data ingestion and improving the real-time performance of data ingestion in the data lake.

[0112] The above embodiments are merely one common example of this application and do not constitute any limitation on the technical scope of this application. Therefore, any minor modifications, equivalent changes, or alterations made to the above content based on the substance of the solution of this application shall still fall within the scope of the technical solution of this application.

[0113] Please continue reading Figure 6 , Figure 6 This is a schematic diagram of an embodiment of the data read / write device provided in this application. The data read / write device 500 of this application embodiment includes a processor 51, a memory 52, an input / output device 53, and a bus 54.

[0114] The processor 51, memory 52, and input / output device 53 are respectively connected to the bus 54. The memory 52 stores program data, and the processor 51 is used to execute the program data to implement the data read / write method described in the above embodiments.

[0115] In this embodiment, processor 51 can also be referred to as a CPU (Central Processing Unit). Processor 51 may be an integrated circuit chip with signal processing capabilities. Processor 51 can also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. The general-purpose processor can be a microprocessor, or processor 51 can be any conventional processor.

[0116] This application also provides a computer storage medium; please refer to the following: Figure 7 , Figure 7 This is a schematic diagram of a computer storage medium according to an embodiment of the present application. The computer storage medium 600 stores program data 61, which is used to implement the data read and write method of the above embodiment when executed by the processor.

[0117] When the embodiments of this application are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0118] The above description is merely an embodiment of this application and does not limit the patent scope of this application. Equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.

Claims

1. A data read / write method based on a data lake, characterized in that, The data read / write method includes: Obtain the recorded data, and obtain the primary key of the recorded data; Read the first primary key list of the underlying files of all filegroups in the data lake; Determine whether the primary key of the recorded data exists in the first primary key list of any base file; If not, read the second primary key list of log files for all filegroups in the data lake; Determine whether the primary key of the recorded data exists in the second primary key list of any log file; If so, write the recorded data into a log file of the second primary key list corresponding to the primary key containing the recorded data; The step of determining whether the primary key of the recorded data exists in the second primary key list of any log file includes: Get the range of primary key values for each log file; A first log file that meets the primary key value condition is obtained based on the value of the primary key of the recorded data, wherein the primary key value condition is that the value of the primary key of the recorded data is within the range of the primary key value of the first log file; Determine whether the primary key of the recorded data exists in the second primary key list of any first log file; The step of determining whether the primary key of the recorded data exists in the second primary key list of any first log file includes: Get the preset Bloom filter for each first log file; A second log file that meets the filtering conditions is obtained based on the primary key of the recorded data, wherein the filtering condition is that the primary key of the recorded data is likely to exist in the output of the Bloom filter in the second log file; Determine whether the primary key of the recorded data exists in the secondary primary key list of any secondary log file; The step of writing the record data into a log file corresponding to the primary key list containing the record data includes: Write the current data block to the end of the log file corresponding to the primary key list containing the primary key of the record data; The footer information of the current data block is set according to the primary key of the recorded data, wherein the footer information includes a Bloom filter and a list of primary keys of the current log file; The step of setting the footer information of the current data block according to the primary key of the recorded data includes: Retrieve the footer information of the previous data block of the current data block in the current log file; The primary key list in the footer information of the previous data block is updated using the primary key of the current data block to generate the footer information of the current data block; The step of updating the primary key list in the footer information of the previous data block using the primary key of the current data block to generate the footer information of the current data block includes: Retrieve the footer information of the previous data block; If the footer information of the previous data block is missing or empty, obtain all data blocks of the current log file and extract the primary key of all data blocks; Generate a primary key list for the current log based on the primary keys of all data blocks and the primary key of the current data block, and generate a Bloom filter based on the primary key list for the current log. Generate the footer information for the current data block based on the primary key list of the current log and the Bloom filter.

2. The data read / write method according to claim 1, characterized in that, The data read / write method further includes: When the primary key of the recorded data exists in the first primary key list of any base file, the recorded data is written to the log file of the file group where the corresponding base file is located; Alternatively, if the primary key of the recorded data does not exist in the first primary key list of any base file and does not exist in the second primary key list of any log file, the recorded data is written to the log file of the smallest filegroup in the data lake.

3. The data read / write method according to claim 1, characterized in that, The data read / write method further includes: The log data of all file groups is compressed into their respective base files according to a preset cycle.

4. A data read / write device, characterized in that, The data read / write device includes a memory and a processor coupled to the memory; The memory is used to store program data, and the processor is used to execute the program data to implement the data read / write method as described in any one of claims 1 to 3.

5. A computer storage medium, characterized in that, The computer storage medium is used to store program data, which, when executed by the computer, is used to implement the data read / write method as described in any one of claims 1 to 3.