Real-time compression and consistency checking platform for multi-source data based on stream computing

By monitoring data traffic and verifying the adaptability of compression algorithms, dynamically adjusting data blocks, and combining multi-source data consistency verification, the problem of insufficient adaptability between multi-source data processing strategies and real-time traffic characteristics in high-concurrency financial business scenarios has been solved, achieving efficient data entry and verification.

CN121770532BActive Publication Date: 2026-06-19BEIJING RONGJIA HECHUANG TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING RONGJIA HECHUANG TECHNOLOGY CO LTD
Filing Date
2026-03-02
Publication Date
2026-06-19

Smart Images

  • Figure CN121770532B_ABST
    Figure CN121770532B_ABST
Patent Text Reader

Abstract

This invention discloses a real-time compression and consistency verification platform for multi-source data based on streaming computing, belonging to the field of streaming data processing. This invention monitors data traffic and, based on the monitoring results, evaluates and triggers compression algorithm compatibility verification. After the evaluation and triggering based on the monitoring results, it performs compression algorithm compatibility verification, and, based on the verification results, evaluates and triggers a dynamic selection mechanism for compression algorithms and a dynamic adjustment mechanism for data chunking. After the evaluation and triggering based on the verification results, it performs multi-source data consistency verification, and, based on the verification results, evaluates and initiates a multi-threaded batch data entry mechanism and a parallel verification task scheduling mechanism. This solves the technical problem that existing technologies lack adaptability to multi-source data processing strategies and real-time traffic characteristics and core data features. Furthermore, as multi-channel data is accessed in parallel in streaming computing scenarios, the processing efficiency shortcomings gradually become prominent and amplified, data entry latency increases, and the adaptability and efficiency of data entry processing are limited.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of streaming data processing technology, and in particular to a platform for real-time compression and consistency verification of multi-source data based on streaming computing. Background Technology

[0002] Against the backdrop of rapid development of the digital economy and the in-depth advancement of the national information security strategy, the scale of business data in key areas such as treasury, financial clearing, power dispatching, and transportation continues to expand. The data sources cover multiple channels, including business systems, voucher repositories, government networks, and banking networks, exhibiting significant characteristics of multi-source heterogeneity, real-time generation, and high security requirements. Traditional data processing models face prominent problems such as rising storage costs, insufficient transmission efficiency, and difficulty in ensuring data consistency. These issues not only affect the timeliness of business processing but also pose challenges to data security and independent controllability. Therefore, conducting real-time compression and consistency verification of multi-source data based on streaming computing has become an inevitable requirement for improving data processing efficiency, reducing resource consumption, and ensuring data credibility.

[0003] The existing related technologies mainly implement the following process: First, through data access and integration components (such as message middleware), the access and standardization conversion of multi-source heterogeneous data (such as business system transaction data, electronic voucher data, banking business transaction data, etc.) are completed, integrating data from different channels and formats into unified streaming data, and uploading the streaming data to the streaming data processing center; Next, based on general compression algorithms, such as LZ4 (Lempel-Ziv 4) and ZSTD (Zstandard), the standardized streaming data is segmented and compressed; Subsequently, based on lightweight hashing or checksums, the consistency between the original data and the compressed data is verified; Finally, the compressed and verified data is transmitted to downstream business (such as treasury revenue and expenditure accounting, fund clearing, reconciliation business, report statistical analysis, etc.) or storage nodes to support subsequent business processing and data analysis.

[0004] Existing technologies related to real-time processing of multi-source data have formed an implementation system with a distributed streaming computing framework at its core and multiple components working together. For example, the Chinese invention patent application with publication number CN120653680A describes a high-performance AIoT real-time data processing method and apparatus based on Flink, Kafka, and Doris. This method includes: constructing an end-to-end layered architecture; capturing changes in multi-source databases in real time through Flink CDC; parsing semi-structured data and generating standardized data streams using Apache NiFi; implementing message routing based on business domain-based KafkaTopics; ensuring data structure consistency using Schema Registry; optimizing transmission efficiency through dynamic compression algorithms; and implementing window aggregation, asynchronous dimension table association, and incremental state persistence strategies using the Flink real-time computing engine.

[0005] Existing multi-source data real-time processing technologies have leveraged distributed streaming computing frameworks such as Flink, combining multiple components such as data capture, parsing, routing, compression, and computation to construct an end-to-end layered processing system. This system enables standardized transformation, transmission optimization, and state persistence of multi-source data, providing fundamental support for real-time data processing. However, it lacks sufficient adaptation to the sudden fluctuations and channel differences in multi-source data during high-concurrency financial business scenarios (such as treasury non-tax revenue scenarios), and no targeted optimization solutions have yet been developed. This leads to the following problems:

[0006] In scenarios involving real-time data entry for financial transactions (such as treasury non-tax revenue), multi-source revenue data processing faces challenges. Revenue data originates from various channels, including specialized business systems, government transmission networks, and fund transfer records from partner institutions. This data volume experiences sudden surges during peak payment periods, with significant differences in redundancy and field density across different channels. Existing technologies typically employ fixed block sizes and single compression algorithms, potentially leading to excessively large data blocks during peak payment periods. This results in increased compression time, low compression ratios for low-redundancy data, data transmission bandwidth congestion, and a surge in storage resource consumption. Ultimately, this leads to excessive verification latency and untimely data statistics during revenue data consistency checks and data entry. There is insufficient adaptability to multi-source data processing strategies, real-time traffic characteristics, and core data features. Furthermore, as multi-channel data is accessed in parallel within streaming computing scenarios, the processing efficiency bottleneck becomes increasingly apparent and amplified, increasing data entry latency and limiting the adaptability and efficiency of data entry processing. Summary of the Invention

[0007] To address the low processing efficiency of existing technologies, this invention provides a multi-source data real-time compression and consistency verification platform based on streaming computing, comprising: a data traffic monitoring module, a compression algorithm adaptability verification module, and a multi-source data consistency verification module. The data traffic monitoring module monitors data traffic and obtains monitoring results. Based on the monitoring results, it evaluates and triggers a compression algorithm adaptability verification to check the matching degree between the current compression algorithm and real-time traffic characteristics. The compression algorithm adaptability verification module, after being triggered based on the monitoring results, performs compression algorithm adaptability verification, obtains the verification results, and evaluates and triggers a dynamic selection of compression algorithms and a dynamic adjustment mechanism for data chunking based on the verification results. The multi-source data consistency verification module, after evaluation based on the verification results and after triggering, performs multi-source data consistency verification, obtains the verification results, and initiates a multi-threaded batch data entry mechanism and a parallel verification task scheduling mechanism based on the verification results.

[0008] The beneficial effects of the technical solutions provided in the embodiments of the present invention include at least the following:

[0009] 1. The multi-source data real-time compression and consistency verification platform based on streaming computing provided by this invention monitors data traffic and obtains the monitoring results. Based on the monitoring results, it evaluates and triggers a compression algorithm adaptability verification to verify the matching degree between the current compression algorithm and real-time traffic characteristics. This helps to capture the dynamic changes of data traffic from various channels and promptly identify sudden increases in data traffic during peak business periods. It provides traffic characteristic basis for subsequent compression algorithm adaptability judgment. After being triggered based on the monitoring results, it performs compression algorithm adaptability verification and obtains the verification results. Based on the verification results, it evaluates and triggers a dynamic selection mechanism for compression algorithms and a dynamic adjustment mechanism for data blocks. This helps to achieve accurate matching between compression algorithms and real-time traffic characteristics and core data characteristics. At the same time, by dynamically adjusting the data block size, it adapts to the sudden data volume during peak business periods, reducing compression... This approach reduces resource consumption and time consumption, improving the real-time compression efficiency of multi-source data. After evaluation based on verification results and triggering completion, multi-source data consistency verification is performed, and the verification results are obtained. Based on the verification results, a multi-threaded batch storage mechanism and a parallel verification task scheduling mechanism are initiated, which helps ensure the integrity, accuracy, and consistency of the compressed multi-source data. At the same time, differentiated batch division and priority scheduling improve the parallelism of qualified data block storage, reduce verification latency and storage scheduling overhead, and achieve the effect of improving storage processing efficiency. This effectively solves the technical problem that existing technologies lack adaptability to multi-source data processing strategies and real-time traffic characteristics and core data characteristics. As multi-channel data is accessed in parallel in streaming computing scenarios, the processing efficiency shortcomings gradually become prominent and amplified, data storage latency increases, and the adaptability and efficiency of storage processing are limited.

[0010] 2. This invention addresses the technical problem in existing technologies where single-dimensional traffic monitoring indicators and collection of only one-dimensional traffic data lead to delayed identification of traffic surge trends. This is achieved by specifically acquiring core traffic indicators from various channels, including average access rate, peak access rate, and cumulative volume of multi-source data. The three types of core traffic indicators selected in this solution capture the changing patterns of data traffic, providing multi-dimensional and highly reliable quantitative evidence for determining traffic surges, thus improving the timeliness and accuracy of traffic surge identification. If the core traffic indicators of each channel meet at least one of the conditions for determining a traffic surge, the corresponding surge state counter value is incremented, and the accumulated surge state value is obtained. The system uses the accumulated value of the sudden state to determine the validity of the sudden flow. If the core traffic indicators of each channel do not meet the judgment conditions for a sudden flow, then it enters the multi-source data consistency verification stage. This helps to solve the problem that the current technology relies solely on the comparison of single indicator thresholds to determine the sudden flow, without considering the continuous characteristics of the sudden flow, which easily leads to the misjudgment of instantaneous traffic fluctuations as valid sudden flows. This solution achieves continuous verification of the sudden flow state by accumulating the sudden state counter and determining the validity of the sudden flow. It can effectively filter out invalid judgments caused by instantaneous traffic fluctuations, and only respond to continuous sudden flows, reducing the waste of system resources, improving the overall efficiency of multi-source data processing, and ensuring the stability and timeliness of data processing in both peak business and normal scenarios.

[0011] 3. When multi-source data is in peak or super-peak business conditions, if the original data backup of a single data block is lost or timed out, it is impossible to complete routine verification operations such as core field comparison and missing field filling based on the original data backup. If the repair process in Implementation Example 1 is still executed, it may lead to a significant increase in verification latency and distortion of verification results. Therefore, an alternative solution for data block integrity repair needs to be implemented. For data blocks that fail to meet the indicators for assessing the integrity of multi-source data, it is determined whether the number of missing or abnormally truncated key fields in the data block exceeds the preset field number threshold. If so, an invalid repair prompt for multi-source data key field missing limit is sent; otherwise, field completion values ​​are generated, which helps to achieve data block repair. Differentiated processing directly terminates invalid repairs and provides a notification for severely damaged data, reducing resource waste and unnecessary time consumption. For repairable data, it generates complete values, providing a compliant and reliable basis for subsequent completion operations. This adapts to resource-constrained scenarios during peak and super-peak business periods. For missing key fields, it fills in the missing key fields based on the field completion values. For abnormally truncated key fields, it performs compliant completion of the truncated field content. This helps ensure the integrity and compliance of core data block information in scenarios without original backup data support, reducing business interruptions or downstream processing errors caused by data loss, lowering the risk of distorted completion results, and ensuring that the completed data meets the specifications for subsequent consistency verification, warehousing, and business processing. Attached Figure Description

[0012] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0013] Figure 1 A schematic diagram of the structure of a multi-source data real-time compression and consistency verification platform based on streaming computing provided in an embodiment of this application;

[0014] Figure 2 A flowchart outlining the overall process of the multi-source data real-time compression and consistency verification platform based on streaming computing provided in this application embodiment;

[0015] Figure 3 A logic diagram of multi-source data consistency verification for a multi-source data real-time compression and consistency verification platform based on streaming computing provided in this application embodiment;

[0016] Figure 4 Training diagram of gradient boosting decision tree model for a multi-source data real-time compression and consistency verification platform based on streaming computing provided in this application embodiment;

[0017] Figure 5 The algorithm adaptation evaluation model fitting performance comparison chart of the multi-source data real-time compression and consistency verification platform based on streaming computing provided in this application embodiment is shown in the embodiment. Detailed Implementation

[0018] The following provides explanations for some of the terms used in this application. It should be noted that these explanations are for the convenience of those skilled in the art and do not constitute a limitation on the scope of protection claimed in this application.

[0019] The embodiments of this application involve at least one, including one or more; wherein, multiple means two or more. Furthermore, it should be understood that in the description of this specification, terms such as "AA1", "AA2", and "AA3" are used only for descriptive purposes and should not be construed as indicating or implying relative importance, nor as indicating or implying order.

[0020] It should be explained that, before deployment, the multi-source data real-time compression and consistency verification platform based on streaming computing provided in this application requires the establishment and continuous maintenance of a set of indicators and data resource libraries to support the efficient operation of the entire platform process. The information sources of this resource library include various basic thresholds and configuration parameters defined by technical personnel based on the streaming data processing mechanism, business scenario requirements, and platform hardware specifications, such as preset peak trigger thresholds, preset growth thresholds, and preset traffic limits. It also integrates a set of effective reference data accumulated during the platform's historical operation, such as compression algorithm adaptation samples under different traffic characteristics of various channels, and effective case records of unqualified data block repair and verification. Regarding data storage... In terms of management architecture, a multi-type database collaborative storage strategy can be adopted. Relational databases can be used to achieve standardized storage and efficient correlation querying of structured threshold parameters, field rules, and channel relationships. At the same time, non-relational databases and time-series databases can be used to complete the rapid storage and accurate retrieval of unstructured and time-series operational data in the massive streaming process, such as real-time processing performance data of each node, dynamic sequences of traffic monitoring, and verification result feedback data. Technical personnel can continuously iterate and optimize various parameters and reference data in the resource library based on the multi-source data processing operation data and business application effect feedback added to the platform to adapt to the ever-changing business data characteristics and platform performance optimization needs.

[0021] It also needs to be explained that the preset compression algorithm matching mapping table and preset data block mapping table, on which the dynamic selection of compression algorithms and dynamic adjustment of data blocks in this application rely, serve as the core support for parameter matching and dynamic adaptation of strategies. Their construction logic, multi-dimensional data support system, and the rules for matching the relationships between parameters must be clearly defined. These two mapping tables provide an adaptability basis for algorithm selection and block adjustment under different traffic characteristics, resource loads, and business scenarios. The correspondence between various input parameters and output results stored in the tables is calibrated by technical personnel based on the processing characteristics of streaming computing, the business characteristics of multi-source data, and actual test data from the platform's operation, and then incorporated into the platform's streaming data processing support system. From a data composition perspective, the preset compression algorithm matching mapping table integrates channel... The system combines multi-dimensional data on three key parameters—core traffic metrics, core data characteristics, and CPU core count—with the matching results of the corresponding optimal compression algorithms. Data sources include core quantitative indicators such as the number of data entries per second and the redundancy of multi-source data, as well as performance test data for various compression algorithms under different CPU core counts and verification data on algorithm adaptation effects under various parameter combinations. The pre-set data block mapping table integrates the correspondence between parameter combinations and dynamic block adjustment coefficients for core resource load status, business priority configuration, and channel traffic share, along with business priority rating data and traffic share statistics for each channel. By analyzing the impact of different parameter combinations on compression processing efficiency and block parallelism, the system ensures accurate matching of appropriate algorithms and block coefficients across various operating scenarios.

[0022] During the construction of parameter correlation relationships, technical personnel will set differentiated matching weights based on the degree of influence of each input parameter on algorithm adaptation and block adjustment. For example, when a channel is in a high-traffic burst state and the data redundancy is high, a high compression ratio algorithm with a higher weight will be matched for it. When the core resource load is high and the traffic ratio of high-priority channels exceeds the threshold, a larger block adjustment coefficient will be matched to improve parallel processing efficiency. At the same time, based on the monitoring data of the platform's long-term operation, continuous statistical verification and iterative optimization will be carried out to eliminate abnormal calibration data caused by non-business factors such as temporary equipment failures and occasional abnormalities in data collection, reduce the interference of invalid data on the mapping relationship, and ensure that the correlation between parameter combinations and output results in the two types of mapping tables has practical engineering application value and statistical credibility.

[0023] It should also be explained that the resource consumption value used to quantify the algorithm's runtime loss in the dynamic selection process of the compression algorithm in this application is based on a targeted weighting system consisting of CPU weight, memory weight, and cluster resource load correction value. This system is pre-calibrated by streaming computing technicians in combination with platform hardware performance and multi-source data processing business requirements, and is synchronously stored in the platform's resource scheduling support system. This provides the core calculation basis for the weighted calculation of CPU usage increment and memory usage increment, as well as the accurate quantification of compression time. Specifically, the technicians first collect massive historical data of the platform's entire process operation, covering CPU usage increment, memory usage increment, and cluster resource load correction value under different traffic characteristics, data types, and algorithm operation scenarios, as well as the actual resource consumption impact of different combinations of indicators. At the same time, they collect resource adaptability verification data of compression algorithm operation in various scenarios, assigning weighted quantitative values ​​to CPU usage increment and memory usage increment respectively that match their impact on overall resource consumption, and assigning matching quantitative values ​​to cluster resource load correction values.

[0024] For example, when the platform is in a high-parallel compression processing stage and CPU computing power becomes the core resource bottleneck, a higher CPU weight is assigned to highlight its proportion in the calculation of resource consumption value. If the platform processes mostly large-byte data blocks and memory usage has a more significant impact on algorithm running efficiency, the quantification value of memory weight is increased. Valid assignment samples of CPU weight and memory weight under various actual business scenarios are recorded simultaneously. Then, through multi-dimensional correlation analysis, abnormal data caused by non-algorithm running factors such as temporary hardware failures, abnormal data collection, and instantaneous system fluctuations are eliminated. Parameter correlation samples with statistical validity and scenario adaptability are selected. Finally, all valid data are integrated to construct the weight and correction value assignment system.

[0025] To make the technical problems, technical solutions and advantages of the present invention clearer, a detailed description will be given below in conjunction with the accompanying drawings and specific embodiments.

[0026] Example 1, as Figure 1The diagram shows the structure of a multi-source data real-time compression and consistency verification platform based on streaming computing provided in this application embodiment. The platform includes: a data traffic monitoring module, a compression algorithm adaptability verification module, and a multi-source data consistency verification module. The data traffic monitoring module is used to capture traffic changes in multi-source revenue data from various channels and identify sudden increases in data traffic during real-time data entry scenarios for financial transactions. It also acquires data traffic monitoring results and, based on these results, evaluates and triggers a compression algorithm adaptability verification to validate the matching degree between the current compression algorithm and real-time traffic characteristics. By monitoring data traffic, it helps to perceive the real-time traffic dynamics of multi-source data from various channels, promptly identify sudden traffic signals during peak business periods, and reduce resource waste caused by indiscriminate verification from the source.

[0027] The compression algorithm compatibility verification module is used to perform compression algorithm compatibility verification after evaluation based on monitoring results and triggering the process. It then obtains the verification results, evaluates and triggers dynamic selection of the optimal compression algorithm to match real-time traffic characteristics and core data characteristics, and a dynamic adjustment mechanism for data blocks to improve parallel compression processing efficiency by dynamically adapting data block sizes when data volume suddenly increases during peak business periods. Compression algorithm compatibility verification helps achieve accurate adaptation between compression algorithms and business scenarios. Simultaneously, by dynamically dividing data blocks to release parallel compression computing power, it effectively reduces compression processing latency and resource consumption during peak business periods, improving the efficiency of real-time compression processing of multi-source data.

[0028] The multi-source data consistency verification module is used to perform multi-source data consistency verification to check the integrity, accuracy, and consistency of compressed multi-source revenue data with the original data after evaluation based on the verification results and triggering the process. It obtains the multi-source data consistency verification results, evaluates and initiates a multi-threaded batch entry mechanism to improve the parallelism of qualified data block entry into the database through differentiated batch partitioning and priority scheduling, and a parallel verification task scheduling mechanism to achieve efficient parallel verification of unqualified data blocks, reducing verification latency caused by excessive single-threaded load. Through multi-source data consistency verification, the validity of compressed data is verified from three dimensions: integrity, accuracy, and consistency. This reduces the impact of distorted or missing data entering the database on subsequent business processing, achieving efficient parallel entry of qualified data and low-latency verification of unqualified data, thus reducing resource scheduling consumption throughout the data processing process.

[0029] In this embodiment, the data traffic monitoring module, compression algorithm compatibility verification module, and multi-source data consistency verification module help to achieve integrated processing from traffic monitoring and compression algorithm adaptation to consistency verification, improving the overall efficiency, stability, and accuracy of multi-source data compression and warehousing processing. Specifically, the business processing logic of the three modules forms an orderly and progressive linkage relationship. The data traffic monitoring module provides accurate scenario judgment basis for the triggering and execution of subsequent modules, avoiding invalid execution. The compression algorithm compatibility verification module lays an efficient and compliant data source foundation for the multi-source data consistency verification module, reducing invalid processing in the verification process. The three modules work together to form a closed-loop streaming data processing mode.

[0030] like Figure 2 The diagram shown is a general overview flowchart of a multi-source data real-time compression and consistency verification platform based on streaming computing, as provided in this application embodiment. Figure 2 It can be seen that: data traffic monitoring is performed, and core traffic metrics for each channel are obtained. It is determined whether all core traffic metrics for each channel meet the criteria for traffic burst judgment. If so, multi-source data consistency verification is initiated; otherwise, burst validity is determined, and the accumulated burst value is obtained. It is then determined whether the accumulated burst value exceeds a preset burst frequency threshold. If not, multi-source data consistency verification is initiated; otherwise, compression algorithm compatibility verification is performed, and the compatibility score of the current compression algorithm for each channel is obtained. It is then determined whether the compatibility score exceeds a preset compatibility score threshold. If so, a dynamic data block adjustment mechanism is triggered; otherwise... If the compression algorithm selection process fails, the target compression algorithm is selected, and its feasibility is verified. The compression time and resource consumption are then determined. If the compression time and resource consumption are less than the preset time threshold and the resource consumption is less than the preset consumption threshold, an algorithm switching instruction is generated. Otherwise, a dynamic data block adjustment mechanism is initiated. After the dynamic data block adjustment mechanism is completed, it is determined whether the compressed data processing throughput is greater than the preset throughput threshold and whether the average compression time is less than the preset time threshold. If not, a block adjustment failure prompt is sent. Otherwise, a multi-source data consistency check is performed.

[0031] Preferably, the specific process of data traffic monitoring is as follows: AA1, based on a multi-threaded parallel acquisition mechanism, core traffic data from various channels (such as the centralized treasury payment channel, the horizontal network channel between finance and taxation and banks, and the treasury revenue and expenditure accounting channel in the treasury scenario, and the road network monitoring data channel, rail transit operation data channel, and logistics freight tracking channel in the transportation scenario) are collected synchronously. This includes the number of data entries per second, the total number of bytes, the average size of data packets, and the data arrival timestamp, ensuring the fine-grained and real-time nature of traffic data collection, and providing a basis for subsequent judgment on whether there is a sudden increase in multi-source data; AA2, the average access rate of multi-source data in each channel within the acquisition window, and the multi-source... Peak data access rate and cumulative multi-source data volume form the core traffic indicators for each channel; average multi-source data access rate is represented by dividing the total number of data bytes monitored by the network traffic analyzer for that channel within the acquisition window by the acquisition window time, and is used to reflect the overall stability of traffic; peak multi-source data access rate is represented by the maximum number of data bytes accessed per second for that channel monitored by the network traffic analyzer within the acquisition window, and is used to reflect the degree of sudden fluctuations in traffic; cumulative multi-source data volume is represented by summing the number of bytes of all data packets for that channel monitored by the network traffic analyzer within the acquisition window, and is used to reflect the overall load scale of traffic.

[0032] AA3, Obtain the traffic burst judgment conditions; the traffic burst judgment conditions indicate that the average access rate of multi-source data is greater than a preset peak trigger threshold, the peak access rate of multi-source data is greater than a preset growth threshold, or the cumulative amount of multi-source data is greater than a preset traffic limit. The preset peak trigger threshold is represented by the average of the average access rate of multi-source data over a historical period, the preset growth threshold is represented by the average of the peak access rate of multi-source data over a historical period, and the preset traffic limit is represented by the average of the cumulative amount of multi-source data over a historical period. If the core traffic indicators of each channel meet at least one of the above traffic burst judgment conditions, then to avoid misjudgment caused by temporary interference such as network jitter or data mistransmission, the corresponding burst... The status counter is incremented, and the burst status accumulation value is monitored by the burst status monitoring probe. The burst status validity is determined based on the burst status accumulation value. If the core traffic indicators of each channel do not meet the burst judgment conditions, the current data traffic is determined to be in a flat state, and then the multi-source data consistency verification is performed. The burst validity judgment means judging whether the burst status accumulation value is greater than the preset burst number threshold. If it is, it is determined that the current data volume shows a burst growth characteristic with business peaks (such as payment peaks), and the compression algorithm adaptability is verified. Otherwise, the current data traffic is determined to be in a flat state, and then the multi-source data consistency verification is performed. The preset burst number threshold is set in advance by preset personnel.

[0033] In this embodiment, data traffic monitoring helps to capture the traffic characteristics of multi-source revenue data from various channels during the collection period, understand the overall stability of traffic, the degree of sudden fluctuations and the overall load scale, and provide real and accurate traffic quantification basis for subsequent compression algorithm adaptability verification, algorithm dynamic selection and data block dynamic adjustment. This enables early prediction and accurate response to sudden traffic surges during business peaks, and ensures dynamic matching between data processing strategies and real-time traffic status in streaming computing scenarios.

[0034] Preferably, the specific process for compression algorithm compatibility verification is as follows: BB1, extract the core features of data from each channel and the identifier of the currently effective compression algorithm (such as LZ4 algorithm, ZSTD algorithm, Snappy algorithm, etc.); the core features of data from each channel include multi-source data redundancy, multi-source data field density, and multi-source data type; multi-source data redundancy is represented by the ratio of the number of bytes of repeated fields to the total number of bytes in the data block; a field refers to the smallest data unit in the data block with a clear business meaning and fixed format boundaries (such as the payer code and amount in the treasury scenario, and the license plate number and road segment number in the traffic scenario); repeated fields indicate that the field content is completely identical within the same collection window. Furthermore, fields belonging to the same business dimension (such as payer information dimension, payee information dimension, transaction scenario dimension, etc.) are considered duplicate fields. For example, in the treasury scenario, within q payment data entries in the same collection window, if the payer code 1001000023 appears n times and the payee treasury code 100001 appears w times, and n is greater than the preset duplication threshold and w is greater than the preset duplication threshold, then the payer code and payee treasury code fields are considered duplicate fields. The number of bytes for duplicate fields is represented by the ratio of the sum of bytes of all fields whose content appears more than the preset duplication threshold to the total number of bytes in the collection window. The preset duplication threshold is set in advance by preset personnel.

[0035] Multi-source data field density is represented by the ratio of the actual number of stored bytes of structured fields (including numerical and character fixed-format fields) in the data block monitored by the network traffic analyzer to the total number of stored bytes of the data block. Multi-source data types represent the structural attributes of the channel's data, specifically divided into three categories: structured data (data with fixed fields and standardized formats, such as standard business transaction logs), semi-structured data (data with variable fields and non-fixed-format content, such as business records with notes), and mixed-structured data (data containing both structured and semi-structured content). The core characteristics of each channel's data, the currently effective compression algorithm identifier, and the current business scenario type identifier (such as real-time revenue data entry scenario, fund settlement scenario, etc.) are input into a preset algorithm adaptation evaluation model (such as gradient boosting decision trees, etc.), outputting an adaptation score for the current compression algorithm for each channel. The adaptation score covers compression efficiency adaptation and resource consumption adaptation. Compression efficiency adaptation indicates that the current compression algorithm, based on the core characteristics of the corresponding channel's data and combined with the current business scenario type, is comparable to the optimal compression algorithm for that scenario in terms of data compression rate and compression ratio. The matching degree of the efficiency benchmark value is used to quantify whether the compression processing efficiency of the compression algorithm for multi-source data in actual business scenarios matches the real-time needs of channel data characteristics and business scenarios. The resource consumption adaptability represents the CPU, memory, and other hardware resource usage generated by the current compression algorithm during operation, and the degree of fit with the platform resource allocation benchmark value and resource load threshold under the corresponding business scenario. It is used to accurately measure whether the resource consumption of the algorithm operation is within a reasonable range allowed by the business scenario, and to avoid the core business processing resources being squeezed due to excessive algorithm resource usage. BB2 is based on the adaptability score to determine the adaptability of the compression algorithm. The specific process is as follows: if the adaptability score of the channel is greater than the preset adaptability score threshold, it is determined that the current compression algorithm of the channel meets the processing needs of the data burst growth scenario, and there is no need to change the algorithm. Then, the data block dynamic adjustment mechanism is triggered. The preset adaptability score threshold is represented by the average adaptability score of the channel over a historical period. If the current adaptability score of the channel is not greater than the preset adaptability score threshold, it is determined that the current compression algorithm of the channel is not adaptable enough, and the dynamic selection process of compression algorithm for the channel is initiated.

[0036] The specific training process of the algorithm adaptation evaluation model is as follows: First, collect historical adaptation data (including data feature parameters, algorithm type, compression efficiency, resource consumption, etc.) of different data features and compression algorithms under multiple business scenarios; then, clean and perform feature engineering on the historical data to construct a feature sample set and corresponding adaptation labels (adapted / unsuitable); subsequently, train the algorithm adaptation evaluation model (such as the gradient boosting decision tree model) based on the feature sample set, and iteratively optimize the model parameters (such as decision tree depth, learning rate, leaf node sample number threshold, etc.) to minimize the adaptation judgment error; finally, verify the model's adaptation judgment accuracy (such as precision, recall, F1 score, etc.) through the test set, and deploy it to the system for real-time evaluation after the accuracy reaches the target.

[0037] In this embodiment, compression algorithm compatibility verification helps to quantitatively evaluate the matching degree between the current compression algorithm and real-time traffic characteristics, core data characteristics, and business scenarios. This provides a basis for dynamic selection of compression algorithms and dynamic adjustment of data blocks, reducing problems such as excessive compression processing latency and unbalanced system resource usage caused by improper algorithm adaptation. It also improves the processing efficiency and resource utilization rationality of real-time compression of multi-source data, realizes dynamic adaptation of compression algorithms to different scenarios such as peak business periods and regular traffic, and ensures the compression performance stability of the streaming computing cluster under parallel processing of multi-channel data.

[0038] Preferably, the specific process for dynamically selecting the compression algorithm is as follows: The core traffic metrics of the current channel, the core data characteristics of the current channel, and the number of CPU cores monitored by the server performance monitor are input into a preset compression algorithm matching mapping table. The corresponding target compression algorithm is obtained through mapping and matching. The feasibility of the target compression algorithm is verified, specifically as follows: Resource load simulation is performed to quantify the compression time and resource consumption of the target compression algorithm under the current cluster resource load (CPU utilization, memory usage). Resource load simulation involves inputting the computational complexity parameters of the target compression algorithm, the cumulative amount of multi-source data, and the core characteristics of the channel data into a resource load simulation model (such as a resource scheduling simulation model, a resource consumption evaluation model, etc.), and outputting the simulated unit data compression time, CPU usage increment, and memory usage. The incremental values ​​are used; the computational complexity parameters of the target compression algorithm are preset by the pre-defined personnel (the inherent parameters of each compression algorithm entered during system initialization, such as the number of instructions in a single compression operation, the baseline value of memory usage, etc.); the compression time and resource consumption value are calculated based on the unit data compression time, CPU usage increment value, and memory usage increment value obtained after the resource load simulation; the compression time is represented by the product of the unit data compression time, the cumulative amount of multi-source data, and the cluster resource load correction value; the resource consumption value is represented by the weighted summation of the CPU usage increment value and the memory usage increment value with the corresponding CPU weight and memory weight, respectively; the CPU weight is used to reflect the degree of influence of the CPU usage increment value on the resource consumption value; the memory weight is used to reflect the degree of influence of the memory usage increment value on the resource consumption value.

[0039] If the compression time is less than a preset time threshold and the resource consumption is less than a preset consumption threshold, the current target compression algorithm is deemed feasible; otherwise, it is deemed infeasible. The preset time threshold is represented by the average compression time over a historical period, and the preset consumption threshold is represented by the average resource consumption over a historical period. If the target compression algorithm is found to be feasible, it is uploaded to the streaming data processing center to generate an algorithm switching instruction. Based on this instruction, the channel compression algorithm is updated and configured. The algorithm switching instruction specifies the target compression algorithm type, effective channel range, effective time window (effective after the current collection window ends), and algorithm configuration parameters. It can be directly sent to the compression processing operator of the streaming computing framework to complete the algorithm switching. If the target compression algorithm is found to be infeasible, the current compression algorithm is maintained, and a dynamic data block adjustment mechanism is initiated based on it. After the dynamic selection of the compression algorithm is completed, the dynamic data block adjustment mechanism is triggered.

[0040] In this embodiment, dynamic selection of compression algorithms helps to reduce problems such as low compression efficiency and excessive consumption of system resources caused by insufficient adaptability of a single compression algorithm. It improves the compression processing efficiency and resource utilization refinement of multi-source data in different business scenarios, realizes intelligent matching and rapid switching of the optimal compression algorithm under different traffic characteristics and data types, and ensures the efficiency and stability of parallel compression of multi-channel data in streaming computing scenarios.

[0041] Preferably, the specific process of the data block dynamic adjustment mechanism is as follows: CC1 collects the core resource load status of the current streaming computing cluster, including the CPU utilization of each processing node monitored by the server performance monitor, the memory occupancy rate monitored by the memory resource monitoring probe, the network transmission rate monitored by the network traffic analyzer, and the number of remaining computing threads monitored by the cluster thread scheduling monitor; at the same time, it extracts the business priority configuration of each channel (such as the data priority of the cooperative institution payment channel is higher than that of ordinary government affairs transmission data) and traffic proportion (such as channel A accounting for 60% of the total traffic, channel B accounting for 30%, etc.), and generates the channel priority ranking result based on the business priority configuration and traffic proportion to ensure that the processing resources of high-priority data are given priority; the channel priority ranking result represents the channel processing priority sequence formed after combining the business priority configuration of each channel, quantifying and classifying it by the system, and sorting it in an orderly manner. The queue is first divided into core levels from high to low according to the preset priority level (such as level 1, level 2, level 3), and then sorted from high to low according to the channel traffic proportion under the same level. This sequence takes the importance of the business scenario and the timeliness requirements of data processing as the core classification basis, and clarifies the order of each channel in resource allocation and task scheduling.

[0042] CC2 inputs the current core resource load status, business priority configuration, and traffic share into the preset data block mapping table to obtain the dynamic block adjustment coefficients for each channel; CC3 generates block adjustment instructions based on the dynamic block adjustment coefficients; the block adjustment instructions indicate that the block size of the corresponding channel data is adjusted step by step in the direction of decreasing data block size, with the dynamic block adjustment coefficient as the step size; CC4 sends the block adjustment instructions to the block processing operator of the streaming computing framework. The operator updates the data block configuration of each channel according to the block adjustment instructions, while continuously monitoring the compressed data processing throughput and average compression time of each block, and evaluates the effectiveness of the block adjustment based on the compressed data processing throughput and average compression time. The specific evaluation process is as follows: if the compressed data processing throughput is greater than the previous compressed data processing throughput, and the average compression time is less than the previous average compression time, then the block size adjustment continues with the current block adjustment instructions until the compressed data processing throughput is reduced. If the throughput is greater than the preset throughput threshold and the average compression time is less than the preset time threshold, then multi-source data consistency verification is initiated. The preset throughput threshold is represented by the average of the compressed data processing throughput over a historical time period, and the preset time threshold is represented by the average of the average compression time over a historical time period. If the above conditions are not met or the data block size is less than the preset block threshold, then the adjustment record (including adjustment time, original data block size, compressed data processing throughput, and average compression time) is written to the log system, and a block adjustment failure prompt is sent. The preset block threshold is set in advance by a preset team. The compressed data processing throughput is represented by the ratio of the total number of bytes of data successfully compressed from each channel within the collection window to the collection window duration. The average compression time is represented by summing the compression times of all compressed data blocks from each channel within the collection window and then comparing the summation result with the total number of data blocks.

[0043] In this embodiment, the dynamic adjustment mechanism for data block partitioning helps to reduce problems such as wasted computing power in parallel processing and excessive latency in processing high-priority channel data caused by fixed partitioning modes. It improves the efficiency of parallel compression processing of multi-source data blocks and the fine-grained scheduling level of cluster resources, reduces processing congestion caused by unreasonable data partitioning during peak business periods, and realizes dynamic and differentiated partitioning of channel data with different traffic scales and different business priorities. This ensures the smoothness of parallel processing of multi-channel data and the timeliness of processing core business data in streaming computing scenarios.

[0044] like Figure 3 The diagram shown illustrates the multi-source data consistency verification logic of a multi-source data real-time compression and consistency verification platform based on streaming computing, as provided in this embodiment of the application. Figure 3It is understood that: Multi-source data consistency verification is performed, and indicators for evaluating the integrity, accuracy, and temporal consistency of multi-source data are obtained. It is then determined whether the quality of all data blocks from the same channel meets the data quality qualification criteria. If so, the verification and storage operation for that channel's data blocks is completed according to the original verification and storage process. Otherwise, quality monitoring is performed on all data blocks in that channel, marking data blocks with substandard data quality as substandard and data blocks with satisfactory data quality as satisfactory. Finally, for the set of satisfactory data blocks, [further steps are taken]. A multi-threaded batch data entry mechanism is implemented, while a parallel verification task scheduling mechanism is initiated for sets of unqualified data blocks. Simultaneously, for data blocks that fail to meet the indicators for evaluating the integrity of multi-source data, data block integrity repair is performed; for data blocks that fail to meet the indicators for evaluating the accuracy of multi-source data, multi-source data accuracy repair is performed; and for data blocks that fail to meet the indicators for evaluating the consistency of multi-source data, data block timing repair is performed. After the parallel verification task scheduling mechanism completes, it is determined whether the multi-source data consistency verification is successful. If it is, the multi-threaded batch data entry mechanism is initiated; otherwise, a parallel verification failure notification is sent.

[0045] Preferably, the specific process for multi-source data consistency verification is as follows: For each data block of each channel, indicators for evaluating the integrity of multi-source data, indicators for evaluating the accuracy of multi-source data, and indicators for evaluating the temporal consistency of multi-source data are obtained respectively. The specific process for obtaining the indicators for evaluating the integrity of multi-source data is as follows: First, extract the preset key field list corresponding to the channel data (such as business serial number, amount, organization code, etc., which are mandatory fields). Based on statistical algorithms (such as field enumeration statistics method, list matching counting method, etc.), count the total number of key fields in the list. The preset key field list is set in advance by preset personnel. Then, verify the key fields in the current data block one by one, and filter out the number of valid key fields that are not empty and have no abnormal truncation (the field length meets the preset field length standard, such as the amount field is fixed at 16 bytes and the organization code is fixed at 8 bytes). Finally, the indicator for evaluating the integrity of multi-source data is obtained by calculating the ratio of the number of valid key fields in the data block monitored by the data quality monitoring instrument to the total number of preset key fields. The total number of preset key fields is the key field for evaluating the integrity of multi-source data. The data volume is pre-set by designated personnel. The specific acquisition process for the indicator evaluating the accuracy of multi-source data is as follows: Select core fields (such as amount, business code, timestamp, etc.) from the data block, calculate the hash value of each core field based on a hash algorithm (such as MD5 hash value), and generate the current core field hash set; simultaneously, retrieve the core field hash values ​​corresponding to the original data block during its initial collection, generating the original core field hash set; compare the current core field hash set with the original core field hash set field by field, and count the number of matching core fields. The retrieval process is as follows: Relying on the unique identifier information of the data block (such as a combination of business serial number and channel code), search the original data hash value backup library of the streaming computing cluster; after matching the original hash value storage entry that corresponds one-to-one with the current data block, extract the original core field hash values ​​retained in the entry; finally, obtain the indicator for evaluating the accuracy of multi-source data by calculating the ratio of the number of matching core fields to the total number of core fields; the core fields are pre-set by designated personnel.

[0046] The evaluation index for the temporal consistency of multi-source data is obtained through the following process: Extract the timestamps of all data records within the current data block in the order of data acquisition, constructing an actual time-series sequence. For example, the time-series sequence could be [(R1, T1), (R2, T2), (R3, T3), ..., (Rn, Tn)], where R1~Rn represent the 1st to nth data records in the current data block, ordered by acquisition sequence, and T1~Tn represent the acquisition timestamps corresponding to each record (e.g., in the format YYYY-MM-DD HH:MM:SS.fff). Count the number of records in the actual time-series sequence that meet the preset time-series rules (timestamp differences within a preset reasonable range, no reverse order). For example, in the actual time-series sequence, R1 corresponds to T1=2026-01-16. 10:00:00.000, R2 corresponds to T2=2026-01-1610:00:00.090, T2-T1=90 milliseconds (within a reasonable range) and T2>T1 (no reverse order), then R2 is determined to meet the rules; if R3 corresponds to T3=2026-01-16 If T3 < T2 (reverse order exists), then R3 is determined to be non-compliant with the rule. Finally, the ratio of the number of records conforming to the preset time sequence rule to the total number of records in the data block is used to obtain the index for evaluating the time sequence consistency of multi-source data. Based on the data quality qualification standard, each data block is first judged to determine whether the indicators for evaluating the integrity of multi-source data, the accuracy of multi-source data, and the time sequence consistency of multi-source data are qualified. The data quality qualification standard is that the indicator for evaluating the integrity of multi-source data is greater than the preset integrity threshold, the indicator for evaluating the accuracy of multi-source data is greater than the preset accuracy threshold, and the indicator for evaluating the time sequence consistency of multi-source data is greater than the preset consistency threshold. The preset integrity threshold is represented by the average value of the indicators for evaluating the integrity of multi-source data over a historical time period; the preset accuracy threshold is represented by the average value of the indicators for evaluating the accuracy of multi-source data over a historical time period; and the preset consistency threshold is represented by the average value of the indicators for evaluating the time sequence consistency of multi-source data over a historical time period.

[0047] If all data blocks from the same channel meet the data quality acceptance criteria, the data quality of that channel is deemed acceptable. The original verification and data entry process (such as single-threaded verification and batch data entry) is then followed to complete the verification and data entry operation for that channel. If any data block from any channel fails to meet the data quality acceptance criteria, to avoid exceeding verification latency limits and causing statistical chaos in data entry, and to prevent acceptable data from being blocked, quality monitoring is performed on all data blocks in that channel. Data blocks with unacceptable data quality are marked as unacceptable, and data blocks with acceptable data quality are marked as acceptable. For the acceptable data block set, a multi-threaded batch data entry mechanism is initiated, while for the unacceptable data block set, a parallel verification task scheduling mechanism is initiated. An acceptable data set represents a collection of data blocks consisting of acceptable data blocks; an unacceptable data block set represents a collection of data blocks consisting of unacceptable data blocks.

[0048] In this embodiment, multi-source data consistency verification helps reduce processing errors caused by missing, distorted, or inconsistent data entering subsequent business processes from the source, reduces business operation failures caused by data quality issues, enables accurate screening of qualified data and timely identification of unqualified data, and provides accurate execution basis for subsequent multi-threaded batch data entry and parallel verification task scheduling.

[0049] The preferred multi-threaded batch data entry mechanism is as follows: For the set of qualified data blocks, each qualified data block is divided into several data batches based on the traffic share of each channel. Qualified data blocks from high-traffic channels (such as the horizontal network channel between the treasury and tax bureaus in the treasury scenario, and the main channel for road network monitoring in the transportation scenario) are divided into preset small batches to improve the parallelism of data entry. Qualified data blocks from low-traffic channels (such as the sporadic non-tax revenue channel in the treasury scenario, and the remote road monitoring channel in the transportation scenario) are divided into preset large batches to reduce the number of data entry scheduling operations. Both the preset small batch division and the preset large batch division are pre-set by preset personnel. High-traffic channels refer to channels with a traffic share greater than a preset traffic threshold, where the preset traffic threshold is represented by the average traffic share of each channel over a historical time period. Low-traffic channels refer to channels with a traffic share not greater than the preset threshold. Simultaneously, based on the channel priority ranking results, a data entry priority list is generated. Priority queue: Qualified data batches from each channel are distributed to the corresponding storage nodes according to their inbound priority queue. Storage nodes write the data according to inbound priority and batch order to avoid core data inbound delays caused by mixed storage of data with different priorities. During the inbound process, each data batch carries an integrity check code (generated based on the hash values ​​of all data blocks within the batch). After the storage node finishes writing, it synchronously verifies the integrity check code. The specific verification process is as follows: The storage node recalculates the overall hash value of the written batch data based on the hash algorithm and compares the recalculated hash value with the integrity check code carried by the batch bit by bit. If the two are completely consistent, it indicates that the data has not been lost or tampered with, and the verification is considered successful. If the two are not completely consistent, the verification is considered to have failed. If the verification is successful, the inbound status is updated to successful, and the inbound time and storage location are recorded. If the verification fails, an inbound exception is recorded and an inbound failure prompt is sent.

[0050] In this embodiment, the multi-threaded batch data entry mechanism helps to reduce problems such as high-traffic channel data entry congestion and core business data processing delays caused by unified batch data entry, improves the overall data entry efficiency of qualified data from multiple sources and the data entry resource utilization of the streaming computing cluster, and realizes parallel and orderly data entry of channel data with different business priorities and different traffic scales.

[0051] Preferably, the specific process of the parallel verification task scheduling mechanism is as follows: DD1, based on the channel priority ranking result, allocate core verification threads to unqualified data blocks of high-priority channels and allocate shared verification threads to unqualified data blocks of low-priority channels, ensuring that the abnormal verification of core business data is prioritized. The specific allocation process is as follows: For high-priority channels, according to the channel priority ranking result, core verification threads are allocated to high-priority channels in descending order of priority level from the preset core verification thread pool; for low-priority channels, unqualified data blocks in low-priority channels are uniformly collected and then... Shared verification threads are allocated in the preset standby verification thread pool according to batch processing needs (e.g., one shared verification thread is allocated for every 10GB of data blocks, or one shared verification thread is allocated for every 5000 data blocks). The thread allocation and usage priority of the core verification thread pool is always higher than that of the standby verification thread pool, and the standby verification thread pool does not occupy any resources of the core verification thread pool. High-priority channels refer to all channels whose priority level is higher than the preset priority level in the channel priority ranking result. These channels correspond to core business scenarios with stringent requirements for data processing timeliness and security (such as the centralized treasury payment channel). The system comprises several channels, including payment channels for government agencies and partner institutions. The preset priority level is set in advance by designated personnel. Low-priority channels refer to all channels whose priority level in the channel priority ranking is no higher than the preset priority level. These channels correspond to routine / non-core business scenarios (such as ordinary government data transmission channels and sporadic non-tax revenue channels). The preset core verification thread pool represents a dedicated management pool of core verification threads, set in advance by designated personnel and enjoying the highest priority in system resource scheduling. This pool contains several independent core verification threads, unaffected by low-priority task resource preemption during operation, and configured with a minimum computing power guarantee threshold to ensure low latency and high stability execution of core verification tasks. The preset backup verification thread pool represents a unified management pool of shared verification threads reserved by the system and with a resource scheduling priority lower than that of core verification threads. This pool contains several dynamically schedulable shared verification threads, shared by all low-priority channels. Their computing power allocation is dynamically adjusted according to the overall system load. When core verification thread resources are scarce, they will proactively avoid them to ensure the execution of core tasks. When system resources are sufficient, they can undertake verification tasks for unqualified data blocks from low-priority channels, improving the overall verification throughput of the platform.

[0052] DD2 initializes the parallel verification task queue, evenly distributing non-compliant data blocks from each channel to different verification threads based on data volume (determined by counting the number of bytes in each non-compliant data block), avoiding excessive load on a single thread. The parallel verification task queue is a task scheduling queue for non-compliant data blocks, internally storing task metadata (including data block identifier and channel information) for each channel's non-compliant data blocks, and updating its status synchronously with the verification tasks. DD3 synchronously starts the data fine-grained verification process for each verification thread. The specific process is as follows: For data blocks that fail to meet the indicators for evaluating the integrity of multi-source data, the verification thread extracts key fields that are missing or abnormally truncated, and records the field name, abnormal location, and number. Based on the block identifier, data block integrity repair is performed; data blocks that fail the multi-source data integrity evaluation index are data blocks whose multi-source data integrity evaluation index does not exceed the preset integrity threshold; data blocks that fail the multi-source data accuracy evaluation index are repaired for multi-source data accuracy; data blocks that fail the multi-source data accuracy evaluation index are data blocks whose multi-source data accuracy evaluation index does not exceed the preset accuracy threshold; for data blocks that fail the multi-source data consistency evaluation index, data block timing repair is performed; data blocks that fail the multi-source data consistency evaluation index are data blocks whose multi-source data consistency evaluation index does not exceed the preset consistency threshold.

[0053] Specifically, the data block integrity repair process is as follows: Based on the preset key field list of each channel, determine the type and standard format of the missing or abnormally truncated key fields (including field length, data type, and default value rules); retrieve the original collection backup data of the data block, and prioritize extracting the original values ​​of the missing or abnormally truncated key fields from the original collection backup data to fill the corresponding missing or abnormally truncated positions in the current data block; after the data block integrity repair is completed, recalculate the indicators for evaluating the integrity of multi-source data. If the indicators for evaluating the integrity of multi-source data are greater than the preset integrity threshold, then continue to determine whether the indicators for evaluating the accuracy of multi-source data are qualified; otherwise, send a notification that the indicators for evaluating the integrity of multi-source data have failed to repair; the specific process for multi-source data accuracy repair is as follows: based on reverse hash value comparison, locate the mismatched core fields, and synchronously retrieve the hash values ​​of the core fields and the corresponding original field values ​​at the time of original collection of the data block. The data localization process is as follows: The hash set of the core fields of the current data block is reverse-matched with the hash set of the original core fields, key-value pairs are matched one by one according to field identifiers. The core fields corresponding to key-value pairs that fail to match are the mismatched fields, achieving precise localization. The mismatched core field values ​​of the current data block are compared with the original field values ​​to determine the anomaly type. Specifically: if the current anomaly type is a field value deviation generated during transmission (such as numerical misalignment or missing characters), the original field value is directly used to overwrite the current anomaly value; if the current anomaly type is field value tampering (the original field value does not conform to the channel data transmission rules), an error message is sent indicating an error in the multi-source data accuracy assessment indicator; after the multi-source data accuracy repair is completed, the multi-source data accuracy assessment indicator is recalculated. If the multi-source data accuracy assessment indicator is greater than the preset accuracy threshold, the multi-source data time sequence consistency assessment indicator is further evaluated to determine if it is qualified; otherwise, a multi-source data accuracy repair failure message is sent.

[0054] Specifically, the data block time sequence repair process is as follows: Based on the historical average transmission interval of the channel and the preset standard time sequence rules of the current acquisition window (such as the difference between adjacent record timestamps being within the preset difference range, no timestamp in reverse order, timestamp matching the acquisition window time range, etc.), a compliant timestamp range for abnormal records is generated. The specific generation process is as follows: taking the timestamp of the first normal record in the current data block as the benchmark, and then based on the time difference interval of the preset standard time sequence rules, the upper and lower fluctuation range corresponding to the theoretical timestamp of each record is determined. This range is the compliant timestamp range for abnormal records. The preset standard time sequence rules are set in advance by preset personnel. The data time sequence is extracted according to the acquisition order, and the timestamp relationship between each record and the previous record is judged in turn. If the timestamp of the later record is earlier than the timestamp of the previous record, the corresponding record is marked as reverse order abnormal. If the difference between the timestamp of the later record and the timestamp of the previous record exceeds the time difference interval of the preset standard time sequence rules, the corresponding record is marked as time difference exceeding the reasonable range. For reverse order abnormalities, the records with reverse order abnormalities are reordered according to the acquisition order. Adjust the record sequence number and synchronously update the order of the corresponding timestamps to ensure that the time sequence is not reversed. For time differences exceeding a reasonable range, correct the abnormal timestamp to the compliant timestamp range based on the timestamps of adjacent normal records and the standard transmission interval. The specific correction process is as follows: Take the timestamp of the normal record preceding the abnormal record as the starting value, add the starting value to the preset transmission interval to obtain the compliant timestamp of the abnormal record. If the compliant timestamp is within the compliant timestamp range, directly replace the abnormal record timestamp with the compliant timestamp. If the compliant timestamp exceeds the compliant timestamp range, send a timestamp correction failure prompt. The preset transmission interval is set in advance by preset personnel. After the data block time sequence repair is completed, recalculate and evaluate the indicators of multi-source data time sequence consistency. If the indicators of multi-source data time sequence consistency are greater than the preset consistency threshold, mark the corresponding data block as a qualified data block and merge it into the qualified data block set, and start the multi-threaded batch storage mechanism. Otherwise, send a parallel verification failure prompt and push it to the operation and maintenance management platform, and archive the unrepairable data block to the backup storage cluster.

[0055] In this embodiment, the parallel verification task scheduling mechanism helps to achieve refined management of resources for verifying unqualified data, reduce problems such as excessive verification latency caused by single-threaded verification or disordered thread allocation, and the squeezing of core business data verification resources, thereby improving the overall verification efficiency of unqualified data and the rational utilization of verification resources in the streaming computing cluster.

[0056] like Figure 4The training diagram of the gradient boosting decision tree model for a multi-source data real-time compression and consistency verification platform based on streaming computing, as shown, illustrates that the input to the gradient boosting decision tree model includes core features of data from various channels (covering multi-source data redundancy, field density, and data type), the currently active compression algorithm identifier (such as LZ4, ZSTD, Snappy, etc.), and the business scenario type identifier (such as real-time revenue data entry, fund settlement, etc.), corresponding to the input module on the left side of the diagram. After the input data enters the gradient boosting decision tree model training process, it first generates a weighted sample subset through weighted sampling. The weight of each round of sampling is dynamically adjusted according to the learning error of the previous base decision tree. For example, samples with low compression efficiency and high resource consumption are given higher weights, allowing subsequent base trees to focus on fitting these error samples. Subsequently, each round of weighted samples trains a base decision tree, and the model is gradually optimized through additive model updates: starting from the initial model F0(x), each round of model F... m (x) is derived from the previous model F m-1 (x) plus the weighted output β of the current base tree m h(x;a m The model is obtained through m iterations and a complete evaluation model is generated. The final output of the model is the adaptability score of the current compression algorithm for each channel (covering compression efficiency adaptability and resource consumption adaptability), which can be directly used to determine whether the current compression algorithm matches the business scenario requirements and provide data basis for algorithm optimization.

[0057] like Figure 5The chart showing the performance comparison of the algorithm adaptation evaluation models for the multi-source data real-time compression and consistency verification platform based on streaming computing reveals the fitting performance of four ensemble learning models—Random Forest (RF), Extreme Tree Ensemble (ET), Extreme Gradient Boosting (XGB), and Gradient Boosting Decision Tree (GBDT)—in the compression algorithm adaptability score prediction task. The horizontal axis represents the true adaptability score of the compression algorithm for each business channel (values ​​range from 0 to 1, comprehensively quantifying the compatibility between compression efficiency and resource consumption), and the vertical axis represents the model prediction adaptability score. The blue and orange scatter points represent training and test set samples, respectively, and the black dashed line is the ideal fitting line (y=x). The closer the scatter points are to this line, the better. The results show that the gradient boosting decision tree model has the most compact distribution of predicted sample points and is closest to the ideal fitting line, with the smallest prediction error (deviations of 0.03, 0.02, and 0.01 for channels A, B, and C, respectively), demonstrating the best performance. The extreme gradient boosting model has the second best fitting performance. The random forest and extreme tree ensemble has high prediction dispersion, with a deviation of 0.08 for channel A, and cannot accurately capture the characteristic of low compression efficiency of the LZ4 algorithm in the payment order flow. These results verify the effectiveness of the gradient boosting decision tree in the compatibility evaluation of compression algorithms and can provide a quantitative basis for the selection and optimization of algorithms such as LZ4, ZSTD, and Snappy.

[0058] Example 2: When multi-source data is in a peak or super-peak business phase, the original data backup of a single data block may be lost or timed out, making it impossible to complete routine verification operations such as core field comparison and missing field filling based on the original data backup. If the repair process of Example 1 is still executed, it may lead to a significant increase in verification latency and distortion of verification results. Therefore, an alternative solution for data block integrity repair needs to be implemented. The specific process is as follows: For data blocks that fail to meet the indicators for evaluating the integrity of multi-source data, determine whether the number of missing or abnormally truncated key fields in the data block exceeds the preset field number threshold. If so, send a prompt that the repair of multi-source data key field missing fields exceeds the limit and is invalid. Otherwise, based on the key field characteristics (such as field value range) of qualified data blocks from the same channel and collection window, the repair process is performed. (This involves defining the scope, field format specifications, and field association mapping relationships, etc.) to generate field completion values. If there are no qualified data blocks in the same channel and collection window, the key field characteristics of qualified data blocks from the same historical collection window of that channel are retrieved as a reference to generate field completion values. The specific generation process is as follows: Extract the value range, standard format, and high-frequency valid values ​​of the key fields of each qualified data block; the high-frequency valid values ​​represent the field values ​​in the qualified data block whose corresponding key field appears more frequently than a preset frequency threshold. If there are no key field values ​​whose frequency of occurrence exceeds the preset frequency threshold, the top preset number of key fields with the highest frequency of occurrence are selected as high-frequency valid values. The preset field number threshold, preset frequency threshold, and preset number are all set in advance by preset personnel.

[0059] For missing key fields, obtain their corresponding high-frequency valid values. Combined with channel-preset field business rules (such as positive values ​​for amount fields, fixed-length character codes, etc.), filter out field completion values ​​and fill in the missing key fields based on these values. For abnormally truncated key fields, perform compliant completion based on the field's preset standard format and length requirements, ensuring that the length of the completed field meets the preset standard. Compliant completion means that while retaining the existing valid content of the truncated field, it follows the preset standard format (such as data type, character encoding) and length requirements of the corresponding key field, supplementing the truncated missing part based on the field completion value. The completed field as a whole meets the business verification rules. Recalculate and evaluate the indicators for assessing the integrity of multi-source data. If the indicators for assessing the integrity of multi-source data are greater than the preset integrity threshold, continue to determine whether the indicators for assessing the accuracy of multi-source data are qualified; otherwise, send a notification that the indicators for assessing the integrity of multi-source data have failed to be repaired.

[0060] In this embodiment, a data block integrity repair scheme designed for scenarios where original backup data is lost or retrieval times out during peak business periods helps to overcome the dependence of conventional verification on original backup data. It solves problems such as verification failure and repair failure caused by abnormal backup data during peak periods, reduces the risk of increased verification latency and distorted verification results caused by improper execution of conventional repair processes, improves the effectiveness and timeliness of multi-source data integrity repair in extreme scenarios during peak business periods, and ensures uninterrupted processing of core financial business data.

Claims

1. A multi-source data real-time compression and consistency checking platform based on stream computing, characterized in that, include: Data traffic monitoring module, compression algorithm compatibility verification module, and multi-source data consistency verification module: The data traffic monitoring module is used to monitor data traffic, obtain data traffic monitoring results, evaluate and trigger compression algorithm compatibility verification based on the monitoring results to verify the matching degree between the current compression algorithm and real-time traffic characteristics. The compression algorithm adaptability verification module is used to perform compression algorithm adaptability verification after being triggered based on the monitoring results, obtain the compression algorithm adaptability verification results, evaluate and trigger the dynamic selection of compression algorithm and dynamic adjustment of data blocks based on the compression algorithm adaptability verification results; The multi-source data consistency verification module is used to perform multi-source data consistency verification after evaluation based on the verification results and after the triggering is completed, and to obtain the multi-source data consistency verification results. Based on the verification results, it starts a multi-threaded batch storage mechanism and a parallel verification task scheduling mechanism. The specific process of data traffic monitoring is as follows: Core traffic data from various channels are collected synchronously using a multi-threaded parallel acquisition mechanism, including the number of data entries per second, the total number of bytes of data, the average size of data packets, and the data arrival timestamp. Calculate the average access rate, peak access rate, and cumulative amount of multi-source data for each channel within the collection window to form the core traffic indicators for each channel. Conditions for identifying traffic bursts; The traffic burst judgment condition indicates that the average access rate of multi-source data is greater than the preset peak trigger threshold, the peak access rate of multi-source data is greater than the preset growth threshold, or the cumulative amount of multi-source data is greater than the preset traffic limit. If the core traffic metrics of each channel meet at least one of the above traffic burst judgment conditions, then the value of the corresponding burst state counter will be accumulated, and the burst state accumulated value will be obtained. The burst validity will be judged based on the burst state accumulated value. If the core traffic metrics of each channel do not meet the criteria for judging a sudden traffic surge, the current data traffic is determined to be in a flat state, and the multi-source data consistency verification is initiated. The burst validity determination means judging whether the cumulative value of the burst state is greater than the preset burst count threshold. If it is, the compression algorithm adaptability is checked; otherwise, the current data traffic is determined to be in a flat state, and multi-source data consistency is checked.

2. The multi-source data real-time compression and consistency verification platform based on streaming computing as described in claim 1, characterized in that, The specific process for verifying the compatibility of the compression algorithm is as follows: Extract the core features of data from each channel and the identifier of the currently effective compression algorithm; Input the core characteristics of data from each channel, the identifier of the currently effective compression algorithm, and the identifier of the current business scenario into the preset algorithm adaptation evaluation model, and output the adaptation score of the current compression algorithm for each channel. The adaptability score covers compression efficiency adaptability and resource consumption adaptability; The process of determining the suitability of compression algorithms based on suitability scores is as follows: If the channel’s compatibility score is greater than the preset compatibility score threshold, it is determined that there is no need to change the algorithm, and the data block dynamic adjustment mechanism is triggered. If the current channel's compatibility score is not greater than the preset compatibility score threshold, the dynamic selection process of the compression algorithm will be initiated for that channel.

3. The multi-source data real-time compression and consistency verification platform based on streaming computing as described in claim 2, characterized in that, The specific process for dynamically selecting the compression algorithm is as follows: Input the core traffic metrics of the current channel, the core data features of the current channel, and the number of CPU cores into the preset compression algorithm matching mapping table, and obtain the corresponding target compression algorithm through mapping matching. The feasibility of the target compression algorithm is verified, and the specific process is as follows: Perform resource load simulation; The compression time and resource consumption are calculated based on the unit data compression time, CPU usage increment, and memory usage increment obtained after the resource load simulation. If the compression time is less than the preset time threshold and the resource consumption is less than the preset consumption threshold, then the current target compression algorithm is deemed feasible; otherwise, the current target compression algorithm is deemed infeasible. If the target compression algorithm is found to be feasible, the target compression algorithm is uploaded to the streaming data processing center to generate an algorithm switching instruction, and the channel compression algorithm is updated and configured based on the algorithm switching instruction. If the target compression algorithm is found to be infeasible, a dynamic data block adjustment mechanism will be initiated based on the current compression algorithm. After the compression algorithm is dynamically selected, a dynamic adjustment mechanism for data block partitioning is triggered.

4. The multi-source data real-time compression and consistency verification platform based on streaming computing as described in claim 3, characterized in that, The specific process of the data block dynamic adjustment mechanism is as follows: Collect the current core resource load status of the streaming computing cluster, including CPU utilization, memory usage, network transmission rate and remaining computing threads of each processing node; At the same time, the business priority configuration and traffic share of each channel are extracted, and the channel priority ranking results are generated based on the business priority configuration and traffic share to ensure that the processing resources of high priority data are given priority. Input the current core resource load status, business priority configuration and traffic percentage into the preset data block mapping table to obtain the dynamic block adjustment coefficient for each channel; Block adjustment instructions are generated based on dynamic block adjustment coefficients; The block adjustment instruction indicates an instruction to adjust the block size of the corresponding channel data step by step in the direction of decreasing data block size, with a dynamic block adjustment coefficient as the step size; The block adjustment command is sent to the block processing operator of the streaming computing framework. The operator updates the data block configuration of each channel according to the block adjustment command requirements, while continuously monitoring the compressed data processing throughput and average compression time of each block. The effectiveness of the block adjustment is evaluated based on the compressed data processing throughput and average compression time. The specific evaluation process is as follows: If the compressed data processing throughput is greater than the previous compressed data processing throughput and the average compression time is less than the previous average compression time, then continue to adjust the block size according to the current block adjustment instruction until the compressed data processing throughput is greater than the preset throughput threshold and the average compression time is less than the preset time threshold, then enter the multi-source data consistency check. If the above conditions are not met or the data block size is less than the preset block threshold, the adjustment record will be written to the log system and a block adjustment failure prompt will be sent.

5. The multi-source data real-time compression and consistency verification platform based on streaming computing as described in claim 4, characterized in that, The specific process of multi-source data consistency verification is as follows: For each data block in each channel, obtain indicators to evaluate the integrity of multi-source data, the accuracy of multi-source data, and the temporal consistency of multi-source data. The specific process for obtaining the indicators used to assess the integrity of multi-source data is as follows: Extract the list of preset key fields corresponding to the data of this channel, and count the total number of key fields in the list; Validate each key field in the current data block and filter the number of valid key fields; An indicator for evaluating the integrity of multi-source data is obtained by calculating the ratio of the number of valid key fields in a data block to the total number of preset key fields. The specific process for obtaining the indicators used to evaluate the accuracy of multi-source data is as follows: Select the core fields in the data block, calculate the hash value of each core field based on the hash algorithm, and generate the current core field hash set; Retrieve the core field hash value corresponding to the original data block during the original acquisition, generate the original core field hash set, compare the current core field hash set with the original core field hash set field by field, and count the number of core fields that match the comparison. An indicator for evaluating the accuracy of multi-source data is obtained by calculating the ratio of the number of core fields that match the comparison to the total number of core fields. The specific process for obtaining the metric used to evaluate the temporal consistency of multi-source data is as follows: Extract the timestamps of all data records in the current data block according to the order of data collection, and construct the actual time series sequence; An index for evaluating the temporal consistency of multi-source data is obtained by calculating the ratio of the number of records that conform to the preset timing rules to the total number of records in the data block; Based on the data quality qualification criteria, we first judge whether the indicators for evaluating the integrity of multi-source data, the indicators for evaluating the accuracy of multi-source data, and the indicators for evaluating the temporal consistency of multi-source data are qualified for each data block. The data quality qualification criteria are as follows: the indicator for evaluating the integrity of multi-source data is greater than a preset integrity threshold, the indicator for evaluating the accuracy of multi-source data is greater than a preset accuracy threshold, and the indicator for evaluating the temporal consistency of multi-source data is greater than a preset consistency threshold. If the quality of all data blocks in the same channel meets the data quality qualification criteria, the verification and storage operation of the data blocks in that channel shall be completed according to the original verification and storage process. If any data block in any channel fails to meet the data quality qualification criteria, quality monitoring is performed on all data blocks in that channel. Data blocks with substandard data quality are marked as unqualified data blocks, while data blocks with qualified data quality are marked as qualified data blocks. For the set of qualified data blocks, a multi-threaded batch data entry mechanism is initiated, while for the set of unqualified data blocks, a parallel verification task scheduling mechanism is initiated.

6. The multi-source data real-time compression and consistency verification platform based on streaming computing as described in claim 5, characterized in that, The specific process of the multi-threaded batch data entry mechanism is as follows: For the set of qualified data blocks, each qualified data block is divided into several data batches based on the traffic share of each channel. Qualified data blocks from high-traffic channels are divided into preset small batches, and qualified data blocks from low-traffic channels are divided into preset large batches. Based on the channel priority ranking results, all channels are divided into high-priority channel layers and low-priority channel layers, and an inbound priority queue is generated. The qualified data batches from each channel are distributed to the corresponding storage nodes according to the inbound priority queue, and the storage nodes write the data according to the inbound priority and batch order.

7. The multi-source data real-time compression and consistency verification platform based on streaming computing as described in claim 5, characterized in that, The specific process of the parallel verification task scheduling mechanism is as follows: Based on the channel priority ranking, core verification threads are allocated to non-compliant data blocks from high-priority channels, while shared verification threads are allocated to non-compliant data blocks from low-priority channels. The specific allocation process is as follows: For high-priority channels, core verification threads are allocated to the preset core verification thread pool in descending order of channel priority ranking. For low-priority channels, unqualified data blocks in low-priority channels are collected and then shared verification threads are allocated in a preset standby verification thread pool according to batch processing requirements. Initialize the parallel verification task queue and evenly distribute the unqualified data blocks from each channel to different verification threads according to the data volume. Initiate the data fine-grained verification process, the specific steps of which are as follows: For data blocks that fail to meet the indicators for assessing the integrity of multi-source data, the verification thread extracts the key fields that are missing or abnormally truncated, records the field names, abnormal locations and data block identifiers, and performs data block integrity repair. For data blocks that fail to meet the criteria for evaluating the accuracy of multi-source data, multi-source data accuracy correction is performed. For data blocks that fail to meet the criteria for assessing the consistency of multi-source data, time sequence repair of the data blocks is performed.

8. The multi-source data real-time compression and consistency verification platform based on streaming computing as described in claim 7, characterized in that, The specific process for multi-source data accuracy restoration is as follows: Based on reverse hash comparison, the core mismatched fields are located. The hash values ​​of the core fields and their corresponding original field values ​​from the original data collection of the data block are retrieved synchronously. The specific location process is as follows: The current data block's core field hash set is reverse-matched with the original core field hash set, key-value pairs are matched one by one according to field identifier. The core field corresponding to the key-value pair that fails to match is the mismatched field. The system compares the current data block with the core field values ​​to determine the anomaly type. The specific process is as follows: If the current exception type is a field value deviation generated during transmission, the original field value will be used to overwrite the current exception value. If the current anomaly type is field value tampering, then send an error message indicating that the indicator for assessing the accuracy of multi-source data is incorrect; After the multi-source data accuracy repair is completed, the indicators for evaluating the accuracy of the multi-source data are recalculated. If the indicators for evaluating the accuracy of the multi-source data are greater than the preset accuracy threshold, the indicators for evaluating the temporal consistency of the multi-source data are further judged to determine whether they are qualified. Otherwise, a multi-source data accuracy repair failure prompt is sent. The specific process of data block timing repair is as follows: Based on the historical average transmission interval of this channel and the preset standard timing rules of the current collection window, a compliant timestamp range for abnormal records is generated. The specific generation process is as follows: Based on the timestamp of the first normal record in the current data block, and the time difference range of the preset standard time sequence rules, the upper and lower fluctuation range of the theoretical timestamp of each record is determined. This range is the compliant timestamp range of abnormal records. Extract the time sequence of data according to the order of collection, and determine the relationship between the timestamp of each record and the previous record in turn. If the timestamp of the later record is earlier than the timestamp of the previous record, the corresponding record is marked as reverse order abnormal. If the difference between the timestamp of the later record and the timestamp of the previous record exceeds the time difference range of the preset standard time sequence rule, the corresponding record is marked as time difference exceeding reasonable range. For reverse order anomalies, the records with reverse order anomalies are reordered according to the order of collection, the record sequence numbers are adjusted, and the order of the corresponding timestamps is updated synchronously to ensure that the time series is not reversed. If the time difference exceeds the reasonable range, the abnormal timestamp is corrected to the compliant timestamp range based on the timestamps of adjacent normal records and the standard transmission interval. After the data block time-series repair is completed, the indicators for evaluating the time-series consistency of multi-source data are recalculated. If the indicators for evaluating the time-series consistency of multi-source data are greater than the preset consistency threshold, the corresponding data block is marked as a qualified data block and merged into the qualified data block set. The multi-threaded batch storage mechanism is then started. Otherwise, a parallel verification failure prompt is sent and pushed to the operation and maintenance management platform. At the same time, the unrepairable data blocks are archived to the backup storage cluster.

9. The multi-source data real-time compression and consistency verification platform based on streaming computing as described in claim 7, characterized in that, The data block integrity repair also includes: For data blocks that fail to meet the indicators for assessing the integrity of multi-source data, it is determined whether the number of missing or abnormally truncated key fields in the data block exceeds a preset threshold. If so, an invalid prompt for repairing missing key fields in multi-source data is sent. Otherwise, field completion values ​​are generated based on the key field characteristics of qualified data blocks from the same channel and acquisition window. If there are no qualified data blocks from the same channel and acquisition window, the key field characteristics of qualified data blocks from the same historical acquisition window of that channel are retrieved as a reference to generate field completion values. The specific generation process is as follows: Extract the value range, standard format, and high-frequency valid values ​​of key fields for each qualified data block; For missing key fields, obtain their corresponding high-frequency valid values, combine them with the channel's preset field business rules, filter out field completion values, and fill in the missing key fields based on the field completion values; For critical fields that are abnormally truncated, the content of the truncated field is completed in compliance with the preset standard format and length requirements of the field to ensure that the length of the completed field meets the preset standard. The indicators for evaluating the integrity of multi-source data are recalculated. If the indicators for evaluating the integrity of multi-source data are greater than the preset integrity threshold, the indicators for evaluating the accuracy of multi-source data are further evaluated to determine whether they are qualified. Otherwise, a prompt to repair the failure of the indicators for evaluating the integrity of multi-source data is sent.