Heterogeneous data source synchronization method and system, electronic device, and storage medium

By acquiring database logs from heterogeneous data sources and using the Flink distributed stream processing engine to package, filter, and merge data change information, the problem of existing tools being unable to balance real-time performance and low invasiveness is solved, achieving efficient and real-time data synchronization.

CN117435670BActive Publication Date: 2026-06-16河钢数字技术股份有限公司 +2

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
河钢数字技术股份有限公司
Filing Date
2023-09-26
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing incremental data synchronization tools have their own problems in terms of real-time performance and intrusiveness, and cannot simultaneously achieve both real-time performance and low intrusiveness.

Method used

By acquiring database logs from heterogeneous data sources, the Flink distributed stream processing engine is used to package data change information into event streams, and then filter, aggregate, and merge them. Combined with the Kafka message middleware, change messages are transmitted to achieve non-intrusive real-time data synchronization.

🎯Benefits of technology

It achieves real-time data synchronization while maintaining low intrusion, improves synchronization efficiency, and ensures real-time and efficient data transmission.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117435670B_ABST
    Figure CN117435670B_ABST
Patent Text Reader

Abstract

The application is suitable for the technical field of electric digital data processing, and provides a heterogeneous data source synchronization method and system, an electronic device and a storage medium. The heterogeneous data source synchronization method comprises the following steps: obtaining data change information of a database log in a heterogeneous data source, and storing the data change information into a first Topic respectively; obtaining the data change information in the first Topic, and packaging the data change information in the first Topic into a plurality of event streams; converting the data change information in each event stream into a MyRecord object, partitioning the event streams to form a plurality of key-controlled streams, and aggregating the MyRecord objects on each key-controlled stream; merging the aggregated plurality of event streams into one event stream, integrating the merged event stream, and storing the event stream into a second Topic; and distributing the data change information in the second Topic to a target database. The application embodiment can simultaneously consider real-time performance, low invasiveness and high synchronization efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of electronic digital data processing technology, and in particular relates to methods, systems, electronic devices and storage media for synchronizing heterogeneous data sources. Background Technology

[0002] In the era of big data, enterprises are accelerating their digital transformation and deepening their work. During this process, enterprises first need to use ETL (Extraction Transformation Load) tools to access and integrate heterogeneous data from different data sources. As the types and volumes of data accessed increase, efficiently solving the problem of incremental data synchronization when data from different sources changes is a key issue that ETL tools urgently need to address.

[0003] Currently available incremental data synchronization tools each have their own issues regarding real-time performance and invasiveness. From a real-time perspective, common data synchronization tools use offline batch processing methods for capturing change data, which cannot guarantee real-time performance. From an invasiveness perspective, methods that support real-time incremental data synchronization may be intrusive to the source database system.

[0004] Therefore, there is an urgent need for a method that can simultaneously achieve real-time performance and low-intrusion synchronization of heterogeneous data sources. Summary of the Invention

[0005] To overcome the problem of not being able to simultaneously achieve real-time performance and low invasiveness, embodiments of this application provide a method, system, electronic device, and storage medium for synchronizing heterogeneous data sources.

[0006] This application is achieved through the following technical solution:

[0007] In a first aspect, embodiments of this application provide a method for synchronizing heterogeneous data sources, including:

[0008] Data change information from database logs in heterogeneous data sources is obtained and stored in a first Topic; the heterogeneous data source includes multiple heterogeneous data sources.

[0009] Data change information in the first Topic is obtained, and based on the Flink distributed stream processing engine, the data change information in the first Topic is packaged into multiple event streams; wherein, the multiple event streams correspond one-to-one with the multiple data sources, and each event stream represents the data change information of the corresponding data source;

[0010] Convert the data change information in each event stream into a MyRecord object;

[0011] Filter out empty MyRecord objects in each event stream; partition the filtered MyRecord objects in the event stream according to the table name and row ID to form multiple keyed streams; use a scrolling window on each keyed stream to aggregate the MyRecord objects in each keyed stream.

[0012] In chronological order, the aggregated event streams are merged into a single event stream, and the merged event stream is then stored in the second Topic.

[0013] Distribute the data change information in the second Topic to the target database.

[0014] In some embodiments, converting the data change information in each event stream into a MyRecord object includes:

[0015] For each event stream, the data change information is converted into a JSON string;

[0016] Based on the Jackson library, parse the JSON string into JsonNode objects;

[0017] Based on the JSON structure, key fields are extracted from the JsonNode object. These key fields include operation type, connector type, data after operation, table name, row ID, and timestamp.

[0018] Construct a MyRecord object based on the extracted key fields and the corresponding data content.

[0019] In some embodiments, an HBase database exists among the plurality of data sources, and the aggregation of MyRecord objects on each keyed stream using a scrolling window includes:

[0020] For the keyed streams divided by the event streams corresponding to the HBase database, the MyRecord objects in the time-based scrolling window are aggregated based on the reduce function, according to the timestamp and the qualifier field; the qualifier field is the identifier of the column in the HBase database.

[0021] In some embodiments, the aggregation of MyRecord objects in a time-based scrolling window based on a timestamp and a qualifier field includes:

[0022] Parse the MyRecord object in the keyed stream into a structured format;

[0023] Two MyRecord objects are selected sequentially from the keyed stream. The qualifier fields in the two MyRecord objects are compared to determine whether the two MyRecord objects have the same qualifier field.

[0024] If they exist, compare the timestamps of the first qualifier field in the two MyRecord objects, where the first qualifier field is the same qualifier field in the two MyRecord objects. The larger timestamp of the first qualifier field in the two MyRecord objects is recorded as the first timestamp and used as the timestamp of the first qualifier field in the aggregated MyRecord object. The value of the qualifier field corresponding to the first timestamp is used as the value of the first qualifier field in the aggregated MyRecord object.

[0025] If it does not exist, the qualifier field, the value of the qualifier field, and the timestamp of the qualifier field from the subsequently selected MyRecord object will be appended to the end of the previously selected MyRecord object.

[0026] If the number of MyRecord objects in the keyed stream is greater than or equal to 2, then a new MyRecord object is selected from the keyed stream and compared with the MyRecord object after the previous aggregation as two objects to be compared. Then, the process jumps to the step of comparing the qualifier field in the two MyRecord objects.

[0027] If the number of MyRecord objects in the keyed stream is equal to 1 at this time, output the aggregated MyRecord objects.

[0028] In some embodiments, an SQL Server database exists among the plurality of data sources, and the aggregation of MyRecord objects on each keyed stream using a scrolling window includes:

[0029] For the keyed flow divided by the event stream corresponding to the SQL Server database, the data in the scrolling count window is aggregated based on the reduce function, according to the operation type and timestamp.

[0030] In some embodiments, the aggregation of data in the scrolling count window based on the operation type and timestamp includes:

[0031] Select two MyRecord objects sequentially from the scrolling counting window and compare the timestamps of the two MyRecord objects;

[0032] Check if the operation type of a MyRecord object with a large timestamp is deletion;

[0033] If so, the operation type of the aggregated MyRecord object is delete;

[0034] If not, check whether the operation type of each MyRecord object in the two MyRecord objects is insertion;

[0035] If the operation type of at least one of the two MyRecord objects is insertion, then the operation type of the aggregated MyRecord object is insertion; based on the operation type of the aggregated MyRecord object, the operation type of the MyRecord object with the larger timestamp is modified, and the modified MyRecord object with the larger timestamp is used as the aggregated MyRecord object;

[0036] If neither of the two MyRecord objects has an operation type of insertion, then the operation type of the aggregated MyRecord object is update; based on the operation type of the aggregated MyRecord object, modify the operation type of the MyRecord object with the larger timestamp, and use the modified MyRecord object with the larger timestamp as the aggregated MyRecord object;

[0037] If the number of MyRecord objects in the scrolling counting window is greater than or equal to 2, then a new MyRecord object is selected from the scrolling counting window and compared with the MyRecord object after the previous aggregation as two objects to be compared, and the process jumps to the step of comparing the timestamps of the two MyRecord objects.

[0038] If the number of MyRecord objects in the scrolling counting window is equal to 1 at this time, then the aggregated MyRecord objects are output.

[0039] In some embodiments, the integration and merging of the event stream, and its storage in a preset Topic, includes:

[0040] Extract key fields from the merged event stream and convert them into JSON format; the key fields include operation type, connector type, data after operation, table name, row ID, and timestamp;

[0041] The key fields in the JSON format are stored as strings in the second Topic.

[0042] Secondly, embodiments of this application provide a heterogeneous data source synchronization system, including:

[0043] The data extraction module is used to obtain data change information from database logs in heterogeneous data sources and store the data change information into a first Topic respectively; the heterogeneous data sources include multiple data sources with different data structures, access methods, and formats.

[0044] The data transformation module is used to obtain data change information in the first Topic and, based on the Flink distributed stream processing engine, package the data change information in the first Topic into multiple event streams; wherein, the multiple event streams correspond one-to-one with the multiple data sources, and each event stream represents the data change information of the corresponding data source;

[0045] The deserializer is used to convert data change information in each event stream into MyRecord objects;

[0046] A single stream processor is used to filter out empty MyRecord objects in each event stream; based on the table name and row ID, the MyRecord objects in the filtered event stream are partitioned to form multiple keyed streams, and a scrolling window is used on each keyed stream to aggregate the MyRecord objects in each keyed stream;

[0047] The merge processor is used to merge multiple aggregated event streams into one event stream in chronological order, integrate the merged event stream, and store it in a second Topic.

[0048] The data distribution module is used to distribute data change information from the second Topic to the target database.

[0049] Thirdly, embodiments of this application provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the heterogeneous data source synchronization method as described in any of the first aspects.

[0050] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the heterogeneous data source synchronization method as described in any of the first aspects.

[0051] Fifthly, embodiments of this application provide a computer program product that, when run on a terminal device, causes the terminal device to execute the heterogeneous data source synchronization method described in any of the first aspects above.

[0052] It is understood that the beneficial effects of the second to fifth aspects mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here.

[0053] The beneficial effects of this application embodiment compared with related technologies are as follows: In this application embodiment, data change information from database logs in heterogeneous data sources is obtained and stored in a first Topic. The database logs are updated in an append-only manner, and the frequency is consistent with the frequency of data changes, thus ensuring data real-time performance. Subsequently, combined with incremental data synchronization tools and Kafka message middleware for the transmission of change messages, the real-time performance of data extraction can be guaranteed. At the same time, using database logs as the basis for monitoring data changes enables non-intrusive reading of database logs. Then, based on the Flink distributed stream processing engine, the data change information in the first Topic is packaged into multiple event streams. Then, the data change information in each event stream is converted into MyRecord objects with structured data, which can help with subsequent processing of these data in Flink. Subsequently, data change information from the same data source is filtered and aggregated, and data change information from heterogeneous data sources is integrated and merged. The processed event streams are then stored in a second Topic, reducing unnecessary data load in subsequent distribution processes. In massive data scenarios, this can significantly improve the execution efficiency of downstream synchronous reproduction programs. Finally, the data change information in the second topic is distributed to the target database. This embodiment of the application can simultaneously achieve real-time performance, low invasiveness, and high synchronization efficiency.

[0054] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this specification. Attached Figure Description

[0055] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0056] Figure 1 This is the overall architecture diagram of the heterogeneous data source synchronization method;

[0057] Figure 2 This is a flowchart illustrating a heterogeneous data source synchronization method provided in an embodiment of this application;

[0058] Figure 3 This is a schematic diagram of the process of aggregating MyRecord objects of an HBase database according to an embodiment of this application;

[0059] Figure 4This is a flowchart illustrating a heterogeneous data source synchronization method provided in another embodiment of this application;

[0060] Figure 5 This is a schematic diagram of the heterogeneous data source synchronization system provided in the embodiments of this application;

[0061] Figure 6 This is a schematic diagram of the structure of a heterogeneous data source synchronization system provided in another embodiment of this application;

[0062] Figure 7 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0063] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail.

[0064] It should be understood that, when used in this application specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or a collection thereof.

[0065] It should also be understood that the term “and / or” as used in this application specification and the appended claims means any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.

[0066] As used in this application specification and the appended claims, the term "if" may be interpreted, depending on the context, as "when," "once," "in response to determination," or "in response to detection." Similarly, the phrase "if determined" or "if detected [the described condition or event]" may be interpreted, depending on the context, as meaning "once determined," "in response to determination," "once detected [the described condition or event]," or "in response to detection [the described condition or event]."

[0067] Furthermore, in the description of this application and the appended claims, the terms "first," "second," "third," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0068] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.

[0069] Currently available incremental data synchronization tools each have their own issues regarding real-time performance and invasiveness. From a real-time perspective, common data synchronization tools use offline batch processing methods for capturing change data, which cannot guarantee real-time performance. From an invasiveness perspective, methods that support real-time incremental data synchronization may be intrusive to the source database system.

[0070] Based on the above problems, this application proposes a heterogeneous data source synchronization method, which is based on a database log change data capture method and combined with an event stream processing engine for real-time processing of change data. This method can perform real-time change synchronization operations while ensuring low intrusion, and the embodiment of this application also has high synchronization efficiency.

[0071] Most databases employ a Write-Ahead Logging (WAL) mechanism to ensure the atomicity and transactionality of data operations. Monitoring database logs guarantees data consistency and prevents data loss. Database logs are updated append-only, with the frequency matching the frequency of data changes, ensuring real-time data transmission. Furthermore, combining incremental data synchronization tools with Kafka message brokers for change message delivery ensures real-time data extraction. Using database logs as the basis for monitoring data changes allows for non-intrusive log reading, preventing significant performance degradation or even database crashes.

[0072] The heterogeneous data source synchronization method provided in this application can be applied to data transmission and exchange between different regions and different database types, as well as the synchronization of data from business systems into data warehouses in big data systems.

[0073] Figure 1 This is the overall architecture diagram of the heterogeneous data source synchronization method, see reference. Figure 1First, data extraction is performed, extracting database change logs from multiple data sources, and storing the data change information in separate topics. Second, a Kafka consumer, in conjunction with the Flink distributed stream processing engine, consumes and processes the data change information stored in the topics, unifying the format of the data change information to a uniform format. This facilitates subsequent processing of data change information from heterogeneous databases. Then, the unified formatted data change information from the heterogeneous databases is merged and analyzed, and the processed data change information is retransmitted and stored in a Kafka topic, awaiting consumption by downstream databases. Downstream target databases subscribe to and consume the processed heterogeneous data source data change information using the Kafka consumer, subsequently reproducing these changes in real time, completing the data synchronization of heterogeneous data sources.

[0074] Figure 2 This is a schematic flowchart of a heterogeneous data source synchronization method provided in an embodiment of this application, with reference to... Figure 2 The method for synchronizing heterogeneous data sources is described in detail below:

[0075] In S201, data change information from database logs in heterogeneous data sources is obtained and stored in the first Topic.

[0076] Heterogeneous data sources include multiple heterogeneous data sources.

[0077] Optionally, real-time incremental data synchronization tools (Debezium, DataX, etc.) or custom-written extraction programs can be used to obtain data change information from database logs of heterogeneous data sources. Subsequently, the captured data change information can be stored in the first Topic for consumption in conjunction with the Kafka message middleware.

[0078] Different databases employ different methods to achieve real-time data extraction and capture based on their characteristics and database log formats. Therefore, in the embodiments of this application, when the heterogeneous data sources are SQL Server and HBase, the Debezium incremental data synchronization tool is selected to extract the SQL Server database change logs. The captured data change information is stored in a designated Topic awaiting consumption. Since there is no existing incremental synchronization tool capable of extracting data change information from HBase database logs, a custom real-time extraction program must be developed to extract the data change information from the HBase database logs and save it to a designated Topic for processing.

[0079] Optionally, based on Kafka's distributed and high-performance fault-tolerance mechanism, the extraction of data change information and the transmission of subsequent processing work can be completed. Even if the application stops service or even crashes suddenly, it will not miss any data changes after restarting.

[0080] Optionally, before obtaining data change information from database logs in heterogeneous data sources, the environment required for Flink program operation can be configured, including configuring the data source IP and port to be processed by the data processing part, the Kafka Topic information to be received, the subsequent rolling window size, and the breakpoint recovery strategy.

[0081] Optionally, after configuring the environment required for running the Flink program, the configured Flink program can be compiled and packaged into an executable JAR file, which can then be uploaded to a cloud server and started.

[0082] In S202, the data change information in the first Topic is obtained, and based on the Flink distributed stream processing engine, the data change information in the first Topic is packaged into multiple event streams.

[0083] Each event stream corresponds to a data source, and each event stream represents the data change information of the corresponding data source.

[0084] In S203, the data change information in each event stream is converted into a MyRecord object.

[0085] In some embodiments of this application, when converting the data change information in each event stream into a MyRecord object, for each event stream, the data change information is converted into a JSON string, then the JSON string is parsed into a JsonNode object based on the Jackson library, and then the key fields in the JsonNode object are extracted based on the JSON structure. Finally, the MyRecord object is constructed based on the extracted key fields and the data content corresponding to the extracted key fields.

[0086] Key fields include operation type, connector type, data after operation, table name, row ID, and timestamp.

[0087] In the embodiments of this application, UTF-8 encoding is used to convert the data change information in each event stream from a byte array into a readable JSON string. Then, using the Jackson library, the JSON string is parsed into JsonNode objects for subsequent field extraction and processing. Next, based on the JSON structure, the nested properties of the JsonNode objects are accessed to extract the required key fields. Finally, a MyRecord object is constructed based on the extracted key fields and their corresponding data content. Converting the data change information in each event stream into a MyRecord object with structured data facilitates easy processing and other operations on this data in Flink.

[0088] In some embodiments of this application, after extracting the key fields from the JsonNode object, the timestamp can be converted into a commonly used and easy-to-understand date and time format, such as January 1, 2020, 6:07:08 AM, 2020-01-01-06-07-08, etc. The specific date and time format to be converted can be determined according to actual needs.

[0089] Optionally, the extracted timestamp can be converted using Java's SimpleDateFormat class.

[0090] In S204, empty MyRecord objects in each event stream are filtered out; the MyRecord objects in the filtered event stream are partitioned according to the table name and row ID to form multiple keyed streams, and a scrolling window is used on each keyed stream to aggregate the MyRecord objects on each keyed stream.

[0091] In the embodiments of this application, an empty MyRecord object is first filtered out using a filter (FilterFunction), and only non-empty MyRecord objects are retained. Then, the filtered MyRecord objects are partitioned according to the table name and row ID to form multiple keyed streams. The table name and row ID are the same in each partition. Then, a scrolling window is applied to aggregate the MyRecord objects on each keyed stream.

[0092] In some embodiments of this application, when multiple data sources contain HBase databases, and a scrolling window is used on each keyed stream to aggregate MyRecord objects on each keyed stream, for the keyed streams divided by the event streams corresponding to the HBase databases, the MyRecord objects in the time-based scrolling window are aggregated based on the timestamp and qualifier fields using the reduce function.

[0093] The qualifier field is the identifier of the column in the HBase database; the size of the time-based scrolling window can be set according to the amount of data processed, and when the window size is set to 1, it indicates that synchronization is fully supported.

[0094] Figure 3 This is a schematic diagram illustrating the process of aggregating MyRecord objects from an HBase database according to an embodiment of this application. (Refer to...) Figure 3 The aggregation process for MyRecord objects in the HBase database is described in detail below:

[0095] When aggregating MyRecord objects in a time-based scrolling window based on timestamps and qualifier fields, the MyRecord objects in the keyed stream can be parsed into a structured format. Then, two MyRecord objects are selected sequentially from the keyed stream, and the qualifier fields in the two MyRecord objects are compared to determine whether the two MyRecord objects have the same qualifier field.

[0096] If they exist, compare the timestamps of the first qualifier field in the two MyRecord objects. The first qualifier field is the same in both MyRecord objects. The larger timestamp of the first qualifier field in the two MyRecord objects is designated as the first timestamp and used as the timestamp of the first qualifier field in the aggregated MyRecord object. The value of the qualifier field corresponding to the first timestamp is then used as the value of the first qualifier field in the aggregated MyRecord object. If they do not exist, append the qualifier field, its value, and its timestamp from the later-selected MyRecord object to the end of the first-selected MyRecord object.

[0097] If the number of MyRecord objects in the keyed stream is greater than or equal to 2, then a new MyRecord object is selected from the keyed stream and compared with the MyRecord object from the previous aggregation. Then, the process jumps to the step of comparing the qualifier field of the two MyRecord objects.

[0098] If the number of MyRecord objects in the keyed stream is equal to 1 at this time, output the aggregated MyRecord object.

[0099] In this embodiment, the MyRecord objects in the filtered event stream are partitioned according to the table name and row ID. This ensures that MyRecord objects with the same table name and row ID are assigned to the same keyed stream, enabling more efficient aggregation operations.

[0100] When aggregating MyRecord objects in an HBase database, the MyRecord objects in the keyed stream must first be parsed into a structured format for subsequent aggregation. During aggregation, two MyRecord objects are randomly selected sequentially from the multiple MyRecord objects in the keyed stream. Then, the qualifier fields of the two MyRecord objects are compared, i.e., the column identifiers of the two MyRecord objects are compared. When two MyRecord objects have the same qualifier field, it indicates a conflict. In this case, the larger timestamp of the first qualifier field in the two MyRecord objects is designated as the first timestamp, and this first timestamp is used as the timestamp of the first qualifier field in the aggregated MyRecord object. Simultaneously, the value of the qualifier field corresponding to the first timestamp is also used as the value of the first qualifier field in the aggregated MyRecord object. Using data with larger timestamps to overwrite data with smaller timestamps ensures that the data in the aggregated MyRecord object is the newest data.

[0101] When two MyRecord objects do not have the same qualifier field, it means that there is no conflict in the qualifier field of the two MyRecord objects. In this case, you only need to add the qualifier field, the value of the qualifier field, and the timestamp of the qualifier field of the later selected MyRecord object to the end of the first selected MyRecord object.

[0102] After aggregating the two selected MyRecord objects, it is necessary to count the number of MyRecord objects in the keyed stream. If the number of MyRecord objects in the keyed stream is greater than or equal to 2, then two MyRecord objects need to be randomly selected from the multiple MyRecord objects in the keyed stream, and the two selected MyRecord objects need to be aggregated. Then the number of MyRecord objects in the keyed stream is counted again. This process is repeated until only one MyRecord object remains in the keyed stream, at which point the aggregated MyRecord object is output.

[0103] Optionally, you can use ObjectMapper to convert the MyRecord object in the keyed stream from a JSON structure to a structured format.

[0104] In some embodiments of this application, when multiple data sources contain SQL Server databases, and a scrolling window is used on each keyed stream to aggregate the MyRecord objects on each keyed stream, for the keyed streams divided by the event streams corresponding to the SQL Server databases, the data in the scrolling counting window is aggregated based on the reduce function, according to the operation type and timestamp.

[0105] The size of the count window on each keyed stream can be set to n, meaning that n records are processed each time. This count window is used to aggregate multiple records into one.

[0106] When aggregating data in a scrolling count window based on operation type and timestamp, you can select two MyRecord objects sequentially from the scrolling count window and compare their timestamps:

[0107] Check if the operation type of a MyRecord object with a large timestamp is deletion.

[0108] If so, the operation type of the aggregated MyRecord object is deletion.

[0109] If not, then check whether the operation type of each MyRecord object in the two MyRecord objects is insertion.

[0110] If at least one of the two MyRecord objects has an operation type of insertion, then the operation type of the aggregated MyRecord object is insertion. Based on the operation type of the aggregated MyRecord object, modify the operation type of the MyRecord object with the larger timestamp, and use the modified MyRecord object with the larger timestamp as the aggregated MyRecord object.

[0111] If neither of the two MyRecord objects has an operation type of insertion, then the operation type of the aggregated MyRecord object is update; based on the operation type of the aggregated MyRecord object, modify the operation type of the MyRecord object with the larger timestamp, and use the modified MyRecord object with the larger timestamp as the aggregated MyRecord object;

[0112] If the number of MyRecord objects in the scrolling count window is greater than or equal to 2, then select a new MyRecord object from the scrolling count window and use it as the two objects to be compared with the MyRecord object after the last aggregation. Then jump to the step of comparing the timestamps of the two MyRecord objects.

[0113] If the number of MyRecord objects in the scrolling count window is equal to 1 at this time, then the aggregated MyRecord objects will be output.

[0114] Because SQL Server and HBase databases store data change information in different formats, the aggregation process for MyRecord objects in SQL Server differs from that in HBase. When aggregating MyRecord objects in SQL Server, two MyRecord objects are randomly selected sequentially from the rolling count window. Their timestamps are then compared, and the operation type is checked simultaneously. If the MyRecord object with the larger timestamp has a delete operation type, it indicates that the new data change information's operation type is delete. At this point, regardless of whether the data change information before the delete operation was inserted or updated, it will be overwritten by the new data change information with the delete operation type. Therefore, the aggregated MyRecord object will then have a delete operation type.

[0115] When the operation type of the MyRecord object with the larger timestamp is not deletion, it is necessary to further determine whether there is a MyRecord object with the operation type of insertion between the two MyRecord objects. If so, the operation type of the aggregated MyRecord object is insertion, and the operation type of the MyRecord object with the larger timestamp is modified to insertion. Then, the modified MyRecord object with the larger timestamp is used as the aggregated MyRecord object. For example, if the operation type of the MyRecord object with the smaller timestamp is insertion, and the operation type of the MyRecord object with the larger timestamp is update, it can be understood as an insertion operation, but the inserted content is the content of the MyRecord object with the larger timestamp. Therefore, the MyRecord object with the larger timestamp whose operation type has been modified is used as the aggregated MyRecord object.

[0116] When the operation type of the MyRecord object with the larger timestamp is not deletion, and the operation type of neither of the two MyRecord objects is insertion, the operation type of the aggregated MyRecord object is update. The operation type of the MyRecord object with the larger timestamp is then modified to update, and the modified MyRecord object with the larger timestamp is used as the aggregated MyRecord object. For example, if both selected MyRecord objects have update operations, it can be understood as an update operation, but the updated content is the content of the MyRecord object with the larger timestamp. The MyRecord object with the larger timestamp can overwrite the MyRecord object with the smaller timestamp, so the MyRecord object with the modified operation type is used as the aggregated MyRecord object.

[0117] After aggregating the two selected MyRecord objects, the number of MyRecord objects in the scrolling count window needs to be counted. If the count is greater than or equal to 2, it means there are still MyRecord objects in the scrolling count window that need to be aggregated. In this case, two MyRecord objects should be randomly selected from the scrolling count window and aggregated. Then, the number of MyRecord objects in the scrolling count window is counted again. If the number of MyRecord objects in the scrolling count window is still greater than or equal to 2, aggregation needs to be performed again until the number of MyRecord objects in the scrolling count window equals 1. When the number of MyRecord objects in the scrolling count window is 1, there are no more MyRecord objects in the scrolling count window that need to be aggregated. At this point, the aggregated MyRecord object is output.

[0118] In S205, multiple aggregated event streams are merged into one event stream according to time sequence, and the merged event stream is integrated and stored in the second Topic.

[0119] In some embodiments of this application, when integrating and merging the event stream and storing it in a preset Topic, key fields in the merged event stream can be extracted and converted into JSON format; the key fields in JSON format are then stored as strings in a second Topic.

[0120] In the embodiments of this application, multiple event streams are merged into one event stream. Then, the `flatMap` function is applied to integrate the merged event stream. During this process, key information is extracted from the records, and based on the extracted information, JSON-formatted data is constructed and stored as a string in a second Topic. The merged and integrated event stream is presented in JSON format, containing key information about data changes, and can be used for subsequent data distribution.

[0121] Before synchronization, the Flink distributed stream processing engine is used to process the acquired data change information. Data change information from the same data source is filtered and aggregated, and data change information from heterogeneous data sources is merged. This reduces unnecessary data load in the subsequent distribution process and can significantly improve the execution efficiency of the downstream synchronization and reproduction program in massive data scenarios.

[0122] In S206, the data change information in the second topic is distributed to the target database.

[0123] The target database includes multiple databases.

[0124] In this embodiment of the application, after the constructed JSON format data is stored as a string in the second topic, the JSON format data in the second topic can be distributed to the downstream target database.

[0125] Downstream target databases start Kafka consumers to subscribe to and consume the integrated data change information stored in the second topic. Different target databases will choose to subscribe to and reproduce different data change information according to their needs. By default, after checking the correctness of the change information, it will directly reproduce it in the local database, thereby completing the synchronization of the changed data.

[0126] Figure 4 This is a schematic flowchart of a heterogeneous data source synchronization method provided in another embodiment of this application, referred to... Figure 4First, configure the environment required for the Flink program to run. This includes configuring the data source IPs and ports to be processed, the Kafka topics to be received, the subsequent rolling window size, and the breakpoint recovery strategy. After configuring the environment, compile and package the configured Flink program into an executable JAR file, upload the JAR file to the cloud server, and start it. Then, start the incremental data synchronization tool to extract data change information from the database logs of heterogeneous data sources and store the extracted data change information in the first topic. Finally, based on the Kafka consumer and the Flink distributed stream processing engine, process the data from each data source. The data change information is consumed and uniformly formatted. The data change information from multiple data sources, which has been processed into a uniform format, is filtered and aggregated separately. That is, the data change information from each data source is filtered and aggregated separately. Then, the aggregated data change information from multiple data sources is merged and integrated into a new data change information. The new data change information is then stored in a second topic, waiting for downstream databases to consume it. The downstream target database uses Kafka consumers to subscribe to and consume the processed heterogeneous data source data change information, and then reproduces these changes in real time, completing the data synchronization of heterogeneous data sources.

[0127] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0128] Corresponding to the heterogeneous data source synchronization method described in the above embodiments, Figure 5 A structural block diagram of a heterogeneous data source synchronization system provided in an embodiment of this application is shown. For ease of explanation, only the parts related to the embodiments of this application are shown.

[0129] See Figure 5 The heterogeneous data source synchronization system in this application embodiment may include a data extraction module 501, a data conversion module 502, a data processing module 503, and a data distribution module 504. The data processing module 503 is used to process the data change information of the extracted heterogeneous data source. This module works based on the Flink distributed stream processing engine and the Kafka message middleware. The Kafka message middleware includes a deserialization schema 505, a single stream processor 506, and a merge processor 507. That is, the data processing module includes a deserializer 505, a single stream processor 506, and a merge processor 507.

[0130] The data extraction module 501 is used to obtain data change information from database logs in heterogeneous data sources and store the data change information into the first Topic respectively; the heterogeneous data source includes multiple heterogeneous data sources.

[0131] The data transformation module 502 is used to obtain the data change information in the first topic and, based on the Flink distributed stream processing engine, package the data change information in the first topic into multiple event streams; wherein, the multiple event streams correspond one-to-one with multiple data sources, and each event stream represents the data change information of the corresponding data source.

[0132] Deserializer 505 is used to convert data change information in each event stream into MyRecord objects.

[0133] A single stream processor 506 is used to filter empty MyRecord objects in each event stream; based on the table name and row ID, the MyRecord objects in the filtered event stream are partitioned to form multiple keyed streams, and a scrolling window is used on each keyed stream to aggregate the MyRecord objects in each keyed stream.

[0134] The merge processor 507 is used to merge multiple aggregated event streams into one event stream in chronological order, integrate the merged event stream, and store it in a second Topic.

[0135] The data distribution module 504 is used to distribute data change information in the second topic to the target database.

[0136] Optionally, deserializer 505 is specifically used for: converting data change information into JSON strings for each event stream; parsing the JSON strings into JsonNode objects based on the Jackson library; extracting key fields from the JsonNode objects based on the JSON structure, including operation type, connector type, data after operation, table name, row ID, and timestamp; and constructing a MyRecord object based on the extracted key fields and the corresponding data content.

[0137] Optionally, the single stream processor 506 may include an HBase single stream processor and a SQL Server single stream processor.

[0138] Specifically, the HBase single-stream processor is used to aggregate MyRecord objects in a time-based scrolling window based on the timestamp and qualifier field, according to the keyed stream divided by the event stream corresponding to the HBase database, using the reduce function; the qualifier field is the identifier of the column in the HBase database.

[0139] Optionally, the HBase single-stream processor is specifically used for: parsing the MyRecord objects in the keyed stream into a structured format; selecting two MyRecord objects sequentially from the keyed stream, comparing the qualifier fields in the two MyRecord objects to determine if they have the same qualifier field; if so, comparing the timestamps of the first qualifier field in the two MyRecord objects, where the first qualifier field is the same qualifier field in both MyRecord objects, and recording the larger timestamp of the first qualifier field in the two MyRecord objects as the first timestamp, which is used as the timestamp of the first qualifier field in the aggregated MyRecord object, and then... The value of the qualifier field corresponding to the timestamp is used as the value of the first qualifier field in the aggregated MyRecord object. If it does not exist, the qualifier field, the value of the qualifier field, and the timestamp of the qualifier field from the subsequently selected MyRecord object are added to the end of the first selected MyRecord object. If the number of MyRecord objects in the keyed stream is greater than or equal to 2, a new MyRecord object is selected from the keyed stream and compared with the previously aggregated MyRecord object. The process then jumps to the step of comparing the qualifier fields of the two MyRecord objects. If the number of MyRecord objects in the keyed stream is equal to 1, the aggregated MyRecord object is output.

[0140] Optionally, the SQL Server single stream processor is specifically used to: aggregate data in the scrolling count window based on the reduce function, according to the operation type and timestamp, for the keyed streams divided by the event stream corresponding to the SQL Server database.

[0141] Optionally, the SQL Server single stream processor is specifically used to: select two MyRecord objects sequentially from the rolling count window, compare the timestamps of the two MyRecord objects, and detect whether the operation type of the MyRecord object with the larger timestamp is deletion.

[0142] If so, the operation type of the aggregated MyRecord object is deletion.

[0143] If not, then check whether the operation type of each MyRecord object in the two MyRecord objects is insertion.

[0144] If at least one of the two MyRecord objects has an operation type of insertion, then the operation type of the aggregated MyRecord object is insertion. Based on the operation type of the aggregated MyRecord object, modify the operation type of the MyRecord object with the larger timestamp, and use the modified MyRecord object with the larger timestamp as the aggregated MyRecord object.

[0145] If neither of the two MyRecord objects has an operation type of insertion, then the operation type of the aggregated MyRecord object is update. Based on the operation type of the aggregated MyRecord object, modify the operation type of the MyRecord object with the larger timestamp, and use the modified MyRecord object with the larger timestamp as the aggregated MyRecord object.

[0146] If the number of MyRecord objects in the scrolling count window is greater than or equal to 2, then a new MyRecord object is selected from the scrolling count window and compared with the MyRecord object from the previous aggregation as two objects to be compared. Then, the process jumps to the step of comparing the timestamps of the two MyRecord objects.

[0147] If the number of MyRecord objects in the scrolling count window is equal to 1 at this time, then the aggregated MyRecord objects will be output.

[0148] Optionally, the merge processor 507 is specifically used to: extract key fields from the merged event stream and convert them into JSON format; key fields include operation type, connector type, data after operation, table name, row ID, and timestamp; and store the key fields in JSON format as strings in a second Topic.

[0149] Figure 6This is a schematic diagram of a heterogeneous data source synchronization system provided in another embodiment of this application. When the upstream databases are HBase and SQL Server, the data extraction module uses the Debezium incremental data synchronization tool to extract the change logs of the SQL Server database. The captured data change information is stored in a designated Topic for consumption. A self-developed real-time extraction program is used to extract the change logs of the HBase database. The captured data change information is stored in a designated Topic for consumption. Then, the deserializer obtains the data change information of the HBase and SQL Server databases from the Topic and converts the data change information of the HBase and SQL Server databases into... A unified format is used to input data change information from both the HBase and SQL Server databases into a single stream processor. The single stream processors filter and aggregate the input data change information, then input the aggregated information into a merging stream processor. This merging stream processor combines the aggregated data change information from the HBase and SQL Server databases into a new data change information set, which is then stored in a second topic, awaiting consumption by the MySQL and Oracle databases.

[0150] It should be noted that the information interaction and execution process between the above-mentioned devices / units are based on the same concept as the method embodiments of this application. For details on their specific functions and technical effects, please refer to the method embodiments section, and they will not be repeated here.

[0151] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the system can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0152] This application also provides an electronic device, see [link to relevant documentation] Figure 7 The electronic device 700 may include: at least one processor 710, a memory 720, and a computer program stored in the memory 720 and executable on the at least one processor 710. When the processor 710 executes the computer program, it implements the steps in any of the above-described method embodiments, for example... Figure 2 S201 to S206 in the illustrated embodiment. Alternatively, when the processor 710 executes the computer program, it implements the functions of each module / unit in the above system embodiments, for example... Figure 5 The functions of each module are shown.

[0153] For example, a computer program may be divided into one or more modules / units, one or more of which are stored in memory 720 and executed by processor 710 to complete this application. The one or more modules / units may be a series of computer program segments capable of performing a specific function, which describe the execution process of the computer program in electronic device 700.

[0154] Those skilled in the art will understand that Figure 7 This is merely an example of an electronic device and does not constitute a limitation on the electronic device. It may include more or fewer components than shown, or combinations of certain components, or different components, such as input / output devices, network access devices, buses, etc.

[0155] The processor 710 can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor.

[0156] The memory 720 can be an internal storage unit of the electronic device or an external storage device, such as a plug-in hard drive, a smart media card (SMC), a secure digital (SD) card, or a flash card. The memory 720 is used to store the computer program and other programs and data required by the electronic device. The memory 720 can also be used to temporarily store data that has been output or will be output.

[0157] The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, the buses shown in the accompanying drawings are not limited to a single bus or a single type of bus.

[0158] The heterogeneous data source synchronization method provided in this application can be applied to terminal devices such as computers, wearable devices, in-vehicle devices, tablet computers, laptops, netbooks, personal digital assistants (PDAs), augmented reality (AR) / virtual reality (VR) devices, and mobile phones. This application does not impose any restrictions on the specific type of terminal device.

[0159] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps described in the various embodiments of the heterogeneous data source synchronization method.

[0160] This application provides a computer program product that, when run on a mobile terminal, enables the mobile terminal to implement the steps described in the various embodiments of the heterogeneous data source synchronization method.

[0161] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments of this application can be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include at least: any entity or device capable of carrying the computer program code to a photographing device / terminal device, a recording medium, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium. Examples include USB flash drives, portable hard drives, magnetic disks, or optical disks.

[0162] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0163] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0164] In the embodiments provided in this application, it should be understood that the disclosed apparatus / network devices and methods can be implemented in other ways. For example, the apparatus / network device embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.

[0165] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0166] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.

Claims

1. A method for synchronizing heterogeneous data sources, characterized in that, include: Obtain data change information from database logs in heterogeneous data sources, and store the data change information into the first Topic respectively; The heterogeneous data source includes multiple heterogeneous data sources; Data change information in the first Topic is obtained, and based on the Flink distributed stream processing engine, the data change information in the first Topic is packaged into multiple event streams; wherein, the multiple event streams correspond one-to-one with the multiple data sources, and each event stream represents the data change information of the corresponding data source; Convert the data change information in each event stream into a MyRecord object; Filter out empty MyRecord objects in each event stream; partition the filtered MyRecord objects in the event stream according to the table name and row ID to form multiple keyed streams; use a scrolling window on each keyed stream to aggregate the MyRecord objects in each keyed stream. In chronological order, the aggregated event streams are merged into a single event stream, and the merged event stream is then stored in the second Topic. Distribute the data change information in the second Topic to the target database; If an HBase database exists among the multiple data sources, the aggregation of MyRecord objects on each keyed stream is performed using a scrolling window, including: For the keyed streams divided by the event streams corresponding to the HBase database, the MyRecord objects in the time-based scrolling window are aggregated based on the reduce function, according to the timestamp and the qualifier field; the qualifier field is the identifier of the column in the HBase database. If an HBase database exists among the multiple data sources, the aggregation of MyRecord objects in the time-based scrolling window based on the timestamp and qualifier fields includes: Parse the MyRecord object in the keyed stream into a structured format; Two MyRecord objects are selected sequentially from the keyed stream. The qualifier fields in the two MyRecord objects are compared to determine whether the two MyRecord objects have the same qualifier field. If they exist, compare the timestamps of the first qualifier field in the two MyRecord objects, where the first qualifier field is the same qualifier field in the two MyRecord objects. The larger timestamp of the first qualifier field in the two MyRecord objects is recorded as the first timestamp and used as the timestamp of the first qualifier field in the aggregated MyRecord object. The value of the qualifier field corresponding to the first timestamp is used as the value of the first qualifier field in the aggregated MyRecord object. If it does not exist, the qualifier field, the value of the qualifier field, and the timestamp of the qualifier field from the subsequently selected MyRecord object will be appended to the end of the previously selected MyRecord object. If the number of MyRecord objects in the keyed stream is greater than or equal to 2, then a new MyRecord object is selected from the keyed stream and compared with the MyRecord object after the previous aggregation as two objects to be compared. Then, the process jumps to the step of comparing the qualifier field in the two MyRecord objects. If the number of MyRecord objects in the keyed stream is equal to 1 at this time, output the aggregated MyRecord objects.

2. The method as described in claim 1, characterized in that, The process of converting data change information in each event stream into a MyRecord object includes: For each event stream, the data change information is converted into a JSON string; Based on the Jackson library, parse the JSON string into JsonNode objects; Based on the JSON structure, key fields are extracted from the JsonNode object. These key fields include operation type, connector type, data after operation, table name, row ID, and timestamp. Construct a MyRecord object based on the extracted key fields and the corresponding data content.

3. The method as described in claim 1, characterized in that, If an SQL Server database exists among the multiple data sources, the aggregation of the MyRecord objects on each keyed stream is performed using a scrolling window, including: For the keyed flow divided by the event stream corresponding to the SQL Server database, the data in the scrolling count window is aggregated based on the reduce function, according to the operation type and timestamp.

4. The method as described in claim 3, characterized in that, If an SQL Server database exists among the multiple data sources, the aggregation of data in the scrolling count window based on operation type and timestamp includes: Select two MyRecord objects sequentially from the scrolling counting window and compare the timestamps of the two MyRecord objects; Check if the operation type of a MyRecord object with a large timestamp is deletion; If so, the operation type of the aggregated MyRecord object is delete; If not, check whether the operation type of each MyRecord object in the two MyRecord objects is insertion; If the operation type of at least one of the two MyRecord objects is insertion, then the operation type of the aggregated MyRecord object is insertion; based on the operation type of the aggregated MyRecord object, the operation type of the MyRecord object with the larger timestamp is modified, and the modified MyRecord object with the larger timestamp is used as the aggregated MyRecord object; If neither of the two MyRecord objects has an operation type of insertion, then the operation type of the aggregated MyRecord object is update; based on the operation type of the aggregated MyRecord object, modify the operation type of the MyRecord object with the larger timestamp, and use the modified MyRecord object with the larger timestamp as the aggregated MyRecord object; If the number of MyRecord objects in the scrolling counting window is greater than or equal to 2, then a new MyRecord object is selected from the scrolling counting window and compared with the MyRecord object after the previous aggregation as two objects to be compared, and the process jumps to the step of comparing the timestamps of the two MyRecord objects. If the number of MyRecord objects in the scrolling counting window is equal to 1 at this time, then the aggregated MyRecord objects are output.

5. The method as described in claim 1, characterized in that, The integrated and merged event stream is stored in a preset Topic, including: Extract key fields from the merged event stream and convert them into JSON format; the key fields include operation type, connector type, data after operation, table name, row ID, and timestamp; The key fields in the JSON format are stored as strings in the second Topic.

6. A heterogeneous data source synchronization system, characterized in that, include: The data extraction module is used to obtain data change information from database logs in heterogeneous data sources and store the data change information into a first Topic respectively; the heterogeneous data source includes multiple heterogeneous data sources; The data transformation module is used to obtain data change information in the first Topic and, based on the Flink distributed stream processing engine, package the data change information in the first Topic into multiple event streams; wherein, the multiple event streams correspond one-to-one with the multiple data sources, and each event stream represents the data change information of the corresponding data source; The deserializer is used to convert data change information in each event stream into MyRecord objects; A single stream processor is used to filter out empty MyRecord objects in each event stream; based on the table name and row ID, the MyRecord objects in the filtered event stream are partitioned to form multiple keyed streams, and a scrolling window is used on each keyed stream to aggregate the MyRecord objects in each keyed stream; The merge processor is used to merge multiple aggregated event streams into one event stream in chronological order, integrate the merged event stream, and store it in a second Topic. The data distribution module is used to distribute data change information in the second Topic to the target database; If an HBase database exists among the multiple data sources, the single stream processor is specifically used for: For the keyed streams divided by the event streams corresponding to the HBase database, the MyRecord objects in the time-based scrolling window are aggregated based on the reduce function, according to the timestamp and the qualifier field; the qualifier field is the identifier of the column in the HBase database. If an HBase database exists among the multiple data sources, the single stream processor is specifically used for: Parse the MyRecord object in the keyed stream into a structured format; Two MyRecord objects are selected sequentially from the keyed stream. The qualifier fields in the two MyRecord objects are compared to determine whether the two MyRecord objects have the same qualifier field. If they exist, compare the timestamps of the first qualifier field in the two MyRecord objects, where the first qualifier field is the same qualifier field in the two MyRecord objects. The larger timestamp of the first qualifier field in the two MyRecord objects is recorded as the first timestamp and used as the timestamp of the first qualifier field in the aggregated MyRecord object. The value of the qualifier field corresponding to the first timestamp is used as the value of the first qualifier field in the aggregated MyRecord object. If it does not exist, the qualifier field, the value of the qualifier field, and the timestamp of the qualifier field from the subsequently selected MyRecord object will be appended to the end of the previously selected MyRecord object. If the number of MyRecord objects in the keyed stream is greater than or equal to 2, then a new MyRecord object is selected from the keyed stream and compared with the MyRecord object after the previous aggregation as two objects to be compared. Then, the process jumps to the step of comparing the qualifier field in the two MyRecord objects. If the number of MyRecord objects in the keyed stream is equal to 1 at this time, output the aggregated MyRecord objects.

7. An electronic device comprising a memory and a processor, wherein the memory stores a computer program executable on the processor, characterized in that, When the processor executes the computer program, it implements the method as described in any one of claims 1 to 5.

8. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1 to 5.