A method and system for hudi asynchronous compression based on hotspot prediction

By using an LSTM model based on hotspot prediction to predict non-hotspot time periods and perform asynchronous compression during these periods, the problem of Hudi's asynchronous compression mechanism consuming computing resources is solved, thus improving the read performance and resource utilization of the data lake.

CN117370288BActive Publication Date: 2026-06-26SHANDONG COMP SCI CENTNAT SUPERCOMP CENT IN JINAN +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANDONG COMP SCI CENTNAT SUPERCOMP CENT IN JINAN
Filing Date
2023-09-21
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In Hudi's Read Optimized Query mode, improper selection of the execution time of the asynchronous compression mechanism can affect the efficiency of data writing and reading, and the default synchronous compression consumes a lot of computing resources, affecting the write efficiency of the data lake.

Method used

By using a hotspot prediction method, an LSTM model is used to predict non-hotspot time periods, and asynchronous compression is performed multiple times during these time periods to avoid frequent merging of .parquet and .log files, thereby improving the freshness of data files and query efficiency.

Benefits of technology

This enables reading fresher data files in Read Optimized Query mode, avoiding write amplification issues, improving resource utilization and query efficiency, and reducing computational resource waste.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117370288B_ABST
    Figure CN117370288B_ABST
Patent Text Reader

Abstract

The application relates to a method and system for Hudi asynchronous compression based on hotspot prediction, which comprises the following steps: step one, inputting original data into a lake and inputting data after an update operation into the lake; step two, obtaining a data set after the data is input into the lake; step three, based on the obtained timestamp and the number of data operations, a trained LSTM model is used to predict a hotspot time period and a non-hotspot time period; and step four, based on the predicted hotspot time period and the non-hotspot time period output by the trained LSTM model, data asynchronous compression is performed. The application considers both improving query efficiency and obtaining newer data. The problem that synchronous compression of the MOR table by default causes waste of computing resources is solved, so that the load balancing of computing resources is realized, and the resource utilization rate is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of big data computing technology, and in particular to a method and system for Hudi asynchronous compression based on hotspot prediction. Background Technology

[0002] A data lake is a collection of large-scale structured and unstructured data that can accommodate data from various data sources, including sensor data, log file data, database data, social media data, and more. Unlike traditional data warehouses, data lakes do not require data to undergo predefined structure and schema transformations before entering the lake; instead, they preserve the original format during storage.

[0003] Hudi (Hadoop Upserts, Deletes, and Incrementals) is an open-source data lake solution designed to simplify data management and processing in large-scale data lakes. Originally developed and open-sourced by Uber, it is now a top-level project of the Apache Software Foundation. Hudi's primary goal is to provide an efficient and scalable way to handle data updates, deletions, and incremental changes in data lakes. Hudi features time-travel queries, incremental fetching and change capture, data indexing, compatibility with multiple data processing engines, and data consistency, making it suitable for enterprises handling large-scale data lakes, especially those requiring frequent data updates and queries. It can help simplify data management, maintenance, and querying operations in data lakes, improving data processing efficiency and performance.

[0004] Hudi's read / write performance is closely related to its two table storage structures. MOR (Merge On Read) tables are a table storage model in data lakes where data is first written incrementally and then merged upon read. This approach offers good performance when handling a large number of write operations because data writes are append-only. However, since data merging occurs during reads, it can impact read performance to some extent. COW (Copy On Write) tables are another table storage model in data lakes where new copies are created as data is written, ensuring each version is complete and independent. This approach has advantages in ensuring data consistency and read performance because each version of the data is independent, and read operations do not require merging. However, the need to create copies can incur some storage overhead.

[0005] For MOR tables, data is stored using columnar .parquet files and row-based .Avro ​​files. Updates are recorded in incremental files, and then synchronous / asynchronous compaction is performed to generate a new version of the columnar file. MOR tables reduce data ingestion latency and have a performance advantage in data writing compared to the write-on-demand merging strategy of COW tables. Therefore, most data lake construction solutions with high data real-time requirements currently choose the MOR pattern.

[0006] In Hudi, MOR tables support three read modes: Snapshot, Incremental Query, and Read Optimized Query. In Snapshot mode, Hudi MOR tables maintain data by creating a snapshot of each write operation. Each data file contains a complete snapshot of the table's data at a specific point in time or version. Therefore, in Snapshot mode, queries run directly on these data files, providing a consistent view of the data, but potentially incurring some additional storage overhead. In Incremental Query mode, Hudi MOR tables maintain an incremental file containing data added or updated with each write operation. These incremental files are typically much smaller than the full snapshot file. During a query, Hudi first applies these incremental files and then applies the previous snapshot file to build the complete view of the data required by the query. In Read Optimized Query mode, Hudi MOR tables merge the incremental and snapshot files to create an optimized view for real-time queries. This mode balances query performance and storage overhead, offering the best read performance.

[0007] However, since the Read Optimized Query mode does not require merging .parquet files, data discrepancies may exist between the actual data and the provided data. To address this issue, Hudi offers both synchronous and asynchronous data compression mechanisms. However, the default synchronous compression consumes significant computational resources, impacting data write efficiency. Asynchronous compression also has its limitations; if the asynchronous compression execution time is inappropriately chosen, performing queries and data updates while compression is in progress will similarly consume substantial computational resources, affecting both write and read efficiency.

[0008] Therefore, in Read Optimized Query mode, knowing when to perform compression to reduce merging operations when reading .parquet files and to ensure that the .log file contains the latest version of the data can undoubtedly greatly improve the read performance of MOR tables. Summary of the Invention

[0009] To address the aforementioned issues, this invention provides a Hudi asynchronous compression method based on hotspot prediction. The invention aims to read all real-time update streams from tables in a source MySQL database into a Hudi database, read the Timeline data for each table, summarize the .deltacommit file, and extract the rows for "numWrites," "numDeletes," "numUpdateWrites," "numInserts," and "schema." By inputting the sums of these rows into an LSTM model, the operation time and the corresponding number of operations within that time period are predicted. The number of operations within a future time period is output, and time periods with fewer than a threshold are designated as non-hotspot periods, while the remaining time periods are designated as hotspot periods. Finally, asynchronous compression is triggered multiple times by a scheduled task during the non-hotspot periods. Using this invention, the .parquet files read from Hudi's MOR tables in Read Optimized Query mode have higher freshness, thus fully leveraging the performance advantages of Read Optimized Query. Furthermore, it avoids the write amplification problem caused by frequent merging of .parquet and .log files in Hudi's automatic compression mechanism.

[0010] The present invention also provides a system for Hudi asynchronous data compression based on hotspot prediction, for implementing the above-mentioned Hudi asynchronous data compression method based on hotspot prediction.

[0011] Terminology Explanation:

[0012] 1. Hudi: Short for Hadoop Upserts Deletes and Incrementals, an open-source data lake designed to simplify data management and processing in large-scale data lakes.

[0013] 2. MOR Table: The "Merge On Read" table is a data table structure in Apache Hudi used to implement incremental updates and merging operations on data.

[0014] 3. Read Optimized Query Mode: A query optimization mode for the Apache Hudi data lake management framework.

[0015] 4. LSTM Model: LSTM (Long Short-Term Memory Network) is a deep learning model widely used in sequence data analysis.

[0016] 5. MySQL: A relational database management system.

[0017] 6. Flink CDC: CDC stands for Change Data Capture, which is a technology used to capture database changes.

[0018] 7. HDFS: Hadoop Distributed File System (HDFS) is a distributed file system.

[0019] 8. .hoodie folder: One of the file layouts under Hudi. The hoodie directory corresponds to the table's metadata information, including the table's version management (Timeline) and archive directory (which stores outdated instants, i.e., versions).

[0020] 9. .deltacommit file: The file in Hudi Timeline that records incremental commits (DELTA_COMMIT). Incremental commit refers to atomically writing a batch of data into a Merge On Read type table.

[0021] The technical solution of this invention is as follows:

[0022] A method for Hudi asynchronous compression based on hotspot prediction includes:

[0023] Step 1: Input the original data into the lake and the data after the update operation into the lake;

[0024] Step 2: After the data is fed into the lake, obtain the dataset, including: Hudi's Timeline generates the corresponding Instant, which records the specific type, timestamp, and status of this operation and saves it in the .deltacommit file; obtain the .deltacommit file and get the timestamp through the .deltacommit file name;

[0025] Step 3: Based on the acquired timestamps and the number of data operations, use the trained LSTM model to predict hot and non-hot time periods;

[0026] Step 4: Based on the predicted output of the trained LSTM model, perform asynchronous data compression multiple times for each of the non-hotspot time periods.

[0027] According to a preferred embodiment of the present invention, in this method, data ingestion into the lake includes:

[0028] Incremental acquisition is used to import raw data into the lake, including reading data stored in the source database MySQL and importing it into the lake. All tables in the source database MySQL that are imported into the lake have separate directories.

[0029] The data is fed into the data lake using a stream processing approach. This includes reading and feeding the data into the lake, with each table in the MySQL source database having its own directory. When a record with a changed field value is re-entered into the lake, the data is updated by first deleting the original record and then inserting a new record.

[0030] According to a preferred embodiment of the present invention, in this method, dataset acquisition includes:

[0031] First, connect to Hudi's underlying storage HDFS and automatically read and download all .deltacommit files in the .hoodie folder;

[0032] Secondly, the downloaded .deltacommit files are aggregated, and the rows recording the number of data operations, including "numWrites", "numDeletes", "numUpdateWrites", "numInserts", and the metadata row "schema", are extracted. "numWrites" records the number of data entries added in a single entry; "numDeletes" records the number of data entries deleted in a single entry; "numUpdateWrites" records the number of data entries added in a single entry, excluding the number of entries added in the first entry; "numInserts" records the number of data entries added in the first entry. A "CommitTime" column is added to indicate the time of the data operation, with the value being the timestamp from the corresponding .deltacommit file.

[0033] Finally, convert and save the rows of records showing the number of operations and the CommitTime column of the summarized data into CSV format.

[0034] According to a preferred embodiment of the present invention, based on the acquired timestamps and the number of data operations (i.e., the rows recording the number of data operations), a trained LSTM model is used to predict hot and non-hot time periods, including:

[0035] The LSTM model consists of three LSTM network layers and one fully connected layer. The three LSTM network layers include the first LSTM network, the second LSTM network, and the third LSTM network, with different numbers of neurons set in the three LSTM network layers.

[0036] The first layer of the LSTM network is used to capture short-term dependencies in the input sequence and learn local patterns and features of the input sequence.

[0037] The input to the second-layer LSTM network is the output of the first-layer LSTM network, which is used to further capture the moderate long-term dependencies of the input sequence based on the first-layer LSTM network.

[0038] The input to the third LSTM network is the output of the second LSTM network. The third LSTM network is used to learn higher-level sequence features and patterns.

[0039] The fully connected layer is used to transform the output of the third-layer LSTM network into the final prediction or output; the fully connected layer also includes the ReLU activation function.

[0040] More preferably, in this method, the LSTM model construction includes:

[0041] First, convert the acquired timestamps to the required format. Then, sum the number of times recorded in the "numWrites", "numDeletes", "numUpdateWrites", and "numInserts" columns according to the time interval, and normalize the data. Finally, perform a sliding window operation on the normalized data.

[0042] Secondly, using the LSTM model as the prediction model, the trained LSTM model is input with the number of operations within the corresponding time interval, and outputs the number of operations to be performed at several future time points (i). The operations are sorted in ascending order, and the time intervals with fewer than G(x) are defined as non-hotspot time intervals. The formula for calculating G(x) is shown in Equation (I).

[0043]

[0044] In formula (I), X i X represents the number of operations corresponding to the i-th time point. min and X max These represent the maximum and minimum number of operations, respectively.

[0045] Finally, the LSTM model is persisted.

[0046] In a further preferred embodiment, the MSE is used as the loss function in the LSTM model, as shown in Equation (II):

[0047]

[0048] In equation (II), y i Represents the actual observed values, This represents the predicted value, and m represents the number of samples.

[0049] According to a preferred embodiment of the present invention, based on multiple non-hotspot time periods output by the trained LSTM model, asynchronous data compression is performed multiple times automatically, including:

[0050] First, modify the relevant configurable parameters of Flink to meet the requirements of asynchronous compression;

[0051] Secondly, the system automatically acquires multiple non-hotspot time periods from the model output and uses these time periods as a basis to perform automated asynchronous compression operations multiple times.

[0052] Finally, the .parquet file in the MOR table is read using the Read Optimized Query mode to obtain more recent data for table data reading and analysis.

[0053] Further optimization involves modifying the relevant configurable parameters of Flink, including:

[0054] Set the compaction.async.enabled configuration parameter to false to disable synchronous compression for all tables;

[0055] Set the compaction.async.enabled configuration parameter to false to enable asynchronous compaction for all tables;

[0056] Set the compaction.schedule.enabled configuration parameter to true to enable the compression schedule to be triggered synchronously by write tasks;

[0057] Set the compaction.trigger.strategy configuration parameter to time_elapsed to make the compression plan generation condition based on time;

[0058] Set the compaction.delta_seconds configuration parameter to 60 to make the compression plan generate once every 60 seconds;

[0059] Setting the hoodie.datasource.query.type configuration parameter to read_optimized changes the query strategy from the default snapshot query mode to read-optimized mode for querying data.

[0060] Further preferred methods include: performing asynchronous compression-related operations multiple times automatically, including:

[0061] Automatically obtain the non-hot time periods output by the LSTM model;

[0062] Connecting a local Python program to a Linux virtual machine;

[0063] Connecting a local Python program to a MySQL library in a Linux virtual machine;

[0064] The time difference between the non-hotspot period and the current time is calculated as the waiting time. When the waiting time ends, Python automatically triggers the execution of a Linux command. The Linux command is a pre-set asynchronous compression command that executes the first asynchronous compression task according to the first compression plan, generating a .parquet file.

[0065] After adding five minutes during non-hotspot periods, the time difference between the new time and the current time is calculated as the waiting time. When the waiting time ends, Python automatically triggers the execution of an SQL command. The SQL command is a pre-set INSERT statement that inserts a marker message indicating that the compression task has been performed.

[0066] After adding ten minutes to the non-hotspot period, the second compression plan is read, and the second asynchronous compression task is executed again to generate a .parquet file, thus obtaining all the latest data within a certain period, including the generation of the first compression plan and the execution of the first asynchronous compression task.

[0067] Perform the above steps multiple times during off-peak hours to ensure that each query yields more recent data.

[0068] A system for Hudi asynchronous compression with hotspot prediction, including

[0069] The data ingestion module is configured to ingest both raw data and data after update operations.

[0070] The dataset acquisition module is configured to: acquire datasets;

[0071] The model building module is configured to build and train LSTM models.

[0072] The prediction module is configured to use a trained LSTM model to predict hot and non-hot time periods.

[0073] The data compression module is configured to perform asynchronous data compression based on the predicted hot and non-hot time periods output by the trained LSTM model.

[0074] A computer device includes a memory and a processor, the memory storing a computer program, the processor executing the computer program to implement the steps of a method for Hudi asynchronous compression with hotspot prediction.

[0075] A computer-readable storage medium having a computer program stored thereon, the computer program being executed by a processor to implement the steps of a method for Hudi asynchronous compression with hotspot prediction.

[0076] The beneficial effects of this invention are as follows:

[0077] This invention proposes a prediction method for hot / non-hot time periods, and performs automated asynchronous data compression on Hudi during non-hot times. This ensures that data files are compressed at appropriate times, ultimately enabling Hudi's MOR table to be queried using Read Optimized Query mode. The basic MOR table file is read without needing to merge .log files, thus improving query efficiency and obtaining more up-to-date data. This solves the problem of wasted computing resources caused by default synchronous compression of MOR tables, achieving load balancing and improving resource utilization. Attached Figure Description

[0078] Figure 1 A flowchart of a Hudi asynchronous data compression method based on hotspot prediction;

[0079] Figure 2 This is a schematic diagram of a Hudi asynchronous data compression system based on hotspot prediction.

[0080] Figure 3 This is a schematic diagram of the network architecture of the LSTM model. Detailed Implementation

[0081] The present invention will be further defined below with reference to the accompanying drawings and embodiments, but is not limited thereto.

[0082] Example 1

[0083] A Hudi asynchronous compression method for hotspot prediction, such as Figure 1 As shown, it includes:

[0084] Step 1: Input the original data into the lake and the data after the update operation into the lake;

[0085] Step 2: After the data is fed into the lake, the dataset is obtained, including: Hudi's Timeline generating the corresponding Instant, which records the specific type, timestamp, and status of this operation and saves it in the .deltacommit file; the .deltacommit file is obtained through automatic download, and the timestamp is obtained through the .deltacommit file name; used to predict hot periods and the number of operations;

[0086] Step 3: Based on the acquired timestamps and the number of data operations, use the trained LSTM model to predict hot and non-hot time periods;

[0087] The unique feature of LSTM lies in its ability to effectively handle long-term dependencies. Through gating mechanisms (forget gate, input gate, and output gate), it precisely controls the flow of information, thus better capturing long-term dependencies. Furthermore, LSTM models introduce the concepts of forget gates and memory units, allowing the model to selectively retain or forget information. This enables LSTM models to handle information changes between inputs at different time steps, thus better adapting to different sequence patterns. Finally, LSTM models can achieve a complete mapping from raw input data to the final output, simplifying the process of many tasks and reducing the need for manual feature engineering.

[0088] Step 4: Based on the predicted output of the trained LSTM model, perform asynchronous data compression multiple times automatically.

[0089] Example 2

[0090] The difference between the Hudi asynchronous compression method for hotspot prediction described in Example 1 and the method described in Example 1 is as follows:

[0091] In this method, data ingestion into the lake includes:

[0092] Incremental acquisition is used to import raw data into the lake, including: reading data stored in the source database MySQL through Flink CDC and importing it into the lake. All tables in the source database MySQL that are imported into the lake (specific tables are mentioned in Example 2) have separate directories; the data imported into the lake is university data distributed in 41 tables across 15 business systems.

[0093] We employ stream processing to feed newly generated or updated data from the source database MySQL into the data lake. This includes: reading and feeding the data into the lake using Flink CDC, with each table in the source database MySQL having its own directory for data feeding; and updating records where field values ​​have changed by first deleting the original record and then inserting a new record.

[0094] In this method, dataset acquisition includes:

[0095] First, connect to Hudi's underlying storage HDFS and automatically read and download all .deltacommit files in the .hoodie folder;

[0096] Secondly, the downloaded .deltacommit files are aggregated, and the rows recording the number of data operations, including "numWrites", "numDeletes", "numUpdateWrites", "numInserts", and the metadata row "schema", are extracted. "numWrites" records the number of data entries added in a single entry; "numDeletes" records the number of data entries deleted in a single entry; "numUpdateWrites" records the number of data entries added in a single entry, excluding the number of entries added in the first entry; "numInserts" records the number of data entries added in the first entry. A "CommitTime" column is added to indicate the time of the data operation, with the value being the timestamp from the corresponding .deltacommit file.

[0097] Finally, convert and save the rows of records showing the number of operations and the CommitTime column of the summarized data into CSV format.

[0098] like Figure 3 As shown, the LSTM model consists of three LSTM network layers and one fully connected layer. The three LSTM network layers include the first LSTM network, the second LSTM network, and the third LSTM network, with different numbers of neurons set in the three LSTM network layers.

[0099] The first layer of the LSTM network is used to capture short-term dependencies in the input sequence; LSTM (Long Short-Term Memory) is a recurrent neural network layer with memory units, which can efficiently process time series data. It learns local patterns and features of the input sequence.

[0100] The input to the second-layer LSTM network is the output of the first-layer LSTM network. It is used to further capture moderate long-term dependencies in the input sequence based on the first-layer LSTM network. It can understand longer contextual information, which helps the model better understand the overall structure and semantics of the input sequence.

[0101] The third-layer LSTM network is part of a deep network. Its input is the output of the second-layer LSTM network, and it further enhances the model's complexity and abstraction capabilities based on the first two layers. The third-layer LSTM network is used to learn higher-level sequence features and patterns to better represent the input data.

[0102] The fully connected layer follows the three-layer LSTM network. It transforms the output of the third LSTM layer into the final prediction or output. The fully connected layer maps the abstract representation of the LSTM layer to the appropriate output space, such as classification labels or continuous value predictions. The fully connected layer also includes a ReLU activation function to introduce non-linearity, enabling the LSTM model to capture more complex patterns.

[0103] In summary, by stacking three layers of LSTM networks and adding fully connected layers, the LSTM model can extract different levels of features and semantic information from the input sequence layer by layer, thereby improving its performance and enabling it to better handle complex sequence data tasks.

[0104] The unique feature of LSTM models lies in their ability to effectively handle long-term dependencies. Through gating mechanisms (forget gate, input gate, and output gate), they precisely control the flow of information, thus better capturing long-term dependencies. Furthermore, LSTM models introduce the concepts of forget gates and memory units, allowing them to selectively retain or forget information. This enables LSTM models to handle changes in input information across different time steps, thus better adapting to different sequence patterns. Finally, LSTM models can achieve a complete mapping from raw input data to the final output, simplifying the process for many tasks and reducing the need for manual feature engineering.

[0105] In this method, the LSTM model construction includes:

[0106] First, convert the acquired timestamps to the required format. Since the timestamps in the .deltacommit file name are in text format, the CommitTime column in the dataset needs to be converted to %Y-%m-%d%H:%M:%S format and the original data is replaced. The number of times recorded in the "numWrites", "numDeletes", "numUpdateWrites", and "numInserts" columns are summed according to the time interval and normalized. A sliding window operation is then performed on the normalized data.

[0107] Secondly, an LSTM model is used as the prediction model. The trained LSTM model is input with the number of operations performed within a given time interval, and outputs the number of operations to be performed at several future time points (i=100). These operations are sorted in ascending order of the number of operations, and time periods with fewer than a threshold (G(x)) are defined as non-hotspot time periods. The formula for calculating G(x) is shown in Equation (I):

[0108]

[0109] In formula (I), X iX represents the number of operations corresponding to the i-th time point. min and X max These represent the maximum and minimum number of operations, respectively.

[0110] Finally, to facilitate the continuous generation of non-hotspot time periods for data compression operations, the LSTM model is persisted.

[0111] In the LSTM model, MSE is used as the loss function, as shown in equation (II):

[0112]

[0113] In equation (II), y i Represents the actual observed values, This represents the predicted value, and m represents the number of samples.

[0114] Asynchronous data compression is performed multiple times across several non-hotspot time periods predicted by the trained LSTM model, including:

[0115] First, modify the relevant configurable parameters of Flink to meet the requirements of data ingestion into the lake and asynchronous compression.

[0116] Secondly, the system automatically acquires multiple non-hotspot time periods from the model output and uses these time periods as a basis to perform automated asynchronous compression operations multiple times.

[0117] Finally, the .parquet file in the MOR table is read using the Read Optimized Query mode, which provides more efficient access to the latest data for table data reading and analysis.

[0118] Modify the relevant configurable parameters of Flink, including:

[0119] Set the compaction.async.enabled configuration parameter to false to disable synchronous compression for all tables;

[0120] Set the compaction.async.enabled configuration parameter to false to enable asynchronous compaction for all tables;

[0121] Set the compaction.schedule.enabled configuration parameter to true to enable the compression schedule to be triggered synchronously by write tasks;

[0122] Set the compaction.trigger.strategy configuration parameter to time_elapsed to make the compression plan generation condition based on time;

[0123] Set the compaction.delta_seconds configuration parameter to 60 to make the compression plan generate once every 60 seconds;

[0124] Setting the hoodie.datasource.query.type configuration parameter to read_optimized changes the query strategy from the default snapshot query mode to read-optimized query mode.

[0125] Multiple automated executions of asynchronous compression-related operations, including:

[0126] Automatically obtain the non-hot time periods output by the LSTM model;

[0127] Connecting a local Python program to a Linux virtual machine;

[0128] Connecting a local Python program to a MySQL library in a Linux virtual machine;

[0129] The time difference between the non-hotspot period and the current time is calculated as the waiting time. When the waiting time ends, Python automatically triggers the execution of a Linux command. The Linux command is a pre-set asynchronous compression command that executes the first asynchronous compression task according to the first compression plan, generating a .parquet file.

[0130] After adding five minutes during non-hotspot periods, the time difference between the new time and the current time is calculated as the waiting time. When the waiting time ends, Python automatically triggers the execution of an SQL command. The SQL command is a pre-set INSERT statement that inserts a marker message indicating that the compression task has been performed.

[0131] Because there is a time interval between the compression plan automatically generated based on the number of submissions and the compression plan executed asynchronously based on the predicted time, new data entering the lake during this time interval will not be compressed into the .parquet file, resulting in a certain degree of reduction in data real-time performance. To solve this problem, this invention adds a second compression step. That is, after adding ten minutes during non-hotspot periods, the second compression plan is read, and the second asynchronous compression task is executed again to generate the .parquet file, thus obtaining all the latest data within a certain period, including the generation of the first compression plan and the execution of the first asynchronous compression task.

[0132] Perform the above steps multiple times during off-peak hours to ensure that each query yields more recent data.

[0133] Example 3

[0134] A system for predicting hotspots using Hudi asynchronous compression, such as Figure 2 As shown, including

[0135] The data ingestion module is configured to ingest both raw data and data after update operations.

[0136] The dataset acquisition module is configured to: acquire datasets;

[0137] The model building module is configured to build and train LSTM models.

[0138] The prediction module is configured to use a trained LSTM model to predict hot and non-hot time periods.

[0139] The data compression module is configured to automatically perform asynchronous data compression multiple times based on multiple non-hotspot time periods output by the trained LSTM model.

[0140] Example 4

[0141] A computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the steps of a Hudi asynchronous compression method for hotspot prediction as described in Embodiment 1 or 2.

[0142] Example 5

[0143] A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of a Hudi asynchronous compression method for hotspot prediction as described in Embodiment 1 or 2.

[0144] Obviously, the examples listed in the specific embodiments are only a part of the examples of this invention, and not all of them. All other examples obtained by those skilled in the art based on the examples of this invention without inventive effort should fall within the protection scope of this invention.

Claims

1. A method for Hudi asynchronous compression based on hotspot prediction, characterized in that, include: Step 1: Input the original data into the lake and the data after the update operation into the lake; Step 2: After the data is fed into the lake, obtain the dataset, including: Hudi's Timeline generates the corresponding Instant, which records the specific type, timestamp, and status of this operation and saves it in the .deltacommit file; obtain the .deltacommit file and get the timestamp through the .deltacommit file name; Step 3: Based on the acquired timestamps and the number of data operations, use the trained LSTM model to predict hot and non-hot time periods; Step 4: Based on the predicted output of the trained LSTM model, perform asynchronous data compression multiple times for each of the non-hotspot time periods. Based on the acquired timestamps and the number of data operations (i.e., the rows recording the number of data operations), a trained LSTM model is used to predict hot and non-hot time periods, including: The LSTM model consists of three LSTM network layers and one fully connected layer. The three LSTM network layers include the first LSTM network, the second LSTM network, and the third LSTM network, with different numbers of neurons set in the three LSTM network layers. The first layer of the LSTM network is used to capture short-term dependencies in the input sequence and learn local patterns and features of the input sequence. The input to the second-layer LSTM network is the output of the first-layer LSTM network, which is used to further capture the moderate long-term dependencies of the input sequence based on the first-layer LSTM network. The input to the third LSTM network is the output of the second LSTM network. The third LSTM network is used to learn higher-level sequence features and patterns. The fully connected layer is used to transform the output of the third-layer LSTM network into the final prediction or output; the fully connected layer also includes the ReLU activation function.

2. The Hudi asynchronous compression method for hotspot prediction according to claim 1, characterized in that, In this method, data ingestion into the lake includes: Incremental acquisition is used to import raw data into the lake, including reading data stored in the source database MySQL and importing it into the lake. All tables in the source database MySQL that are imported into the lake have separate directories. The data is fed into the data lake using a stream processing approach. This includes reading and feeding the data into the lake, with each table in the MySQL source database having its own directory. When a record with a changed field value is re-entered into the lake, the data is updated by first deleting the original record and then inserting a new record.

3. The Hudi asynchronous compression method for hotspot prediction according to claim 1, characterized in that, In this method, dataset acquisition includes: First, connect to Hudi's underlying storage HDFS and automatically read and download all .deltacommit files in the .hoodie folder; Secondly, the downloaded .deltacommit files are compiled, and the rows recording the number of data operations, including "numWrites", "numDeletes", "numUpdateWrites", "numInserts", and the metadata row "schema", are extracted to record the number of data entering the lake for the first time; "numWrites" refers to the number of data entering the lake in one instance; "numDeletes" refers to the number of data being deleted in one instance; "numUpdateWrites" refers to the number of data being added in one instance, but does not include the number of data added in the first instance; "numInserts" refers to the number of data entering the lake for the first time; and a CommitTime column is added to indicate the time of the data operation, with the value of the CommitTime column being the timestamp in the corresponding .deltacommit file; Finally, convert and save the rows of records showing the number of operations and the CommitTime column of the summarized data into CSV format.

4. The Hudi asynchronous compression method for hotspot prediction according to claim 1, characterized in that, In this method, the LSTM model construction includes: First, convert the acquired timestamps to the required format. Then, sum the number of times recorded in the "numWrites", "numDeletes", "numUpdateWrites", and "numInserts" columns according to the time interval, and normalize the data. Finally, perform a sliding window operation on the normalized data. Secondly, using the LSTM model as the prediction model, the trained LSTM model is input with the number of operations within the corresponding time interval, and outputs the number of operations to be performed at several future time points (i). The operations are sorted in ascending order, and the time intervals with fewer than G(x) are defined as non-hotspot time intervals. The formula for calculating G(x) is shown in Equation (I). (I) In formula (I), Let be the number of operations corresponding to the i-th time point. and These represent the maximum and minimum number of operations, respectively. Finally, the LSTM model is persisted.

5. The Hudi asynchronous compression method for hotspot prediction according to claim 1, characterized in that, In the LSTM model, MSE is used as the loss function, as shown in equation ( As shown in I): ( I) Mode( I), Represents the actual observed values, This represents the predicted value, and m represents the number of samples.

6. A method for Hudi asynchronous compression based on hotspot prediction according to any one of claims 1-5, characterized in that, Based on multiple non-hotspot time periods output by the trained LSTM model, asynchronous data compression is performed multiple times automatically, including: First, modify the relevant configurable parameters of Flink to meet the requirements of asynchronous compression; Secondly, the system automatically acquires multiple non-hotspot time periods from the model output and uses these time periods as a basis to perform automated asynchronous compression operations multiple times. Finally, the .parquet file in the MOR table is read using the Read Optimized Query mode to obtain more recent data for table data reading and analysis; Modify the relevant configurable parameters of Flink, including: Set the compaction.async.enabled configuration parameter to false to disable synchronous compression for all tables; Set the compaction.async.enabled configuration parameter to false to enable asynchronous compaction for all tables; Set the compaction.schedule.enabled configuration parameter to true to enable the compression schedule to be triggered synchronously by write tasks; Set the compaction.trigger.strategy configuration parameter to time_elapsed to make the compression plan generation condition based on time; Set the compaction.delta_seconds configuration parameter to 60 to make the compression plan generate once every 60 seconds; Setting the hoodie.datasource.query.type configuration parameter to read_optimized changes the query strategy from the default snapshot query mode to read-optimized mode for querying data. Multiple automated executions of asynchronous compression-related operations, including: Automatically obtain the non-hot time periods output by the LSTM model; Connecting a local Python program to a Linux virtual machine; Connecting a local Python program to a MySQL library in a Linux virtual machine; The time difference between the non-hotspot period and the current time is calculated as the waiting time. When the waiting time ends, Python automatically triggers the execution of a Linux command. The Linux command is a pre-set asynchronous compression command that executes the first asynchronous compression task according to the first compression plan, generating a .parquet file. After adding five minutes during non-hotspot periods, the time difference between the new time and the current time is calculated as the waiting time. When the waiting time ends, Python automatically triggers the execution of an SQL command. The SQL command is a pre-set INSERT statement that inserts a marker message indicating that the compression task has been performed. After adding ten minutes to the non-hotspot period, the second compression plan is read, and the second asynchronous compression task is executed again to generate a .parquet file, thus obtaining all the latest data within a certain period, including the generation of the first compression plan and the execution of the first asynchronous compression task. Perform the above steps multiple times during off-peak hours to ensure that each query yields more recent data.

7. A system for Hudi asynchronous compression with hotspot prediction, characterized in that, include The data ingestion module is configured to ingest both raw data and data after update operations. The dataset acquisition module is configured to: acquire datasets; The model building module is configured to build and train LSTM models. The prediction module is configured to use a trained LSTM model to predict hot and non-hot time periods. The data compression module is configured to perform asynchronous data compression based on the predicted hot and non-hot time periods output by the trained LSTM model. Based on the acquired timestamps and the number of data operations (i.e., the rows recording the number of data operations), a trained LSTM model is used to predict hot and non-hot time periods, including: The LSTM model consists of three LSTM network layers and one fully connected layer. The three LSTM network layers include the first LSTM network, the second LSTM network, and the third LSTM network, with different numbers of neurons set in the three LSTM network layers. The first layer of the LSTM network is used to capture short-term dependencies in the input sequence and learn local patterns and features of the input sequence. The input to the second-layer LSTM network is the output of the first-layer LSTM network, which is used to further capture the moderate long-term dependencies of the input sequence based on the first-layer LSTM network. The input to the third LSTM network is the output of the second LSTM network. The third LSTM network is used to learn higher-level sequence features and patterns. The fully connected layer is used to transform the output of the third-layer LSTM network into the final prediction or output; the fully connected layer also includes the ReLU activation function.

8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the Hudi asynchronous compression method for hotspot prediction as described in any one of claims 1-6.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of a Hudi asynchronous compression method for hotspot prediction as described in any one of claims 1-6.