A multi-source data efficient statistical analysis method based on a distributed parallel architecture
By introducing a chaos-aware antifragile scheduling preparation mechanism into a distributed parallel architecture, task allocation and data flow are dynamically adjusted, which solves the problem of insufficient adaptability to the micro-uncertainty of the system in the existing technology and realizes efficient and reliable multi-source data statistical analysis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI MARKWAY INTELLIGENT TECHNOLOGY CO LTD
- Filing Date
- 2026-03-06
- Publication Date
- 2026-06-30
AI Technical Summary
When facing efficient statistical analysis of multi-source heterogeneous data, existing distributed parallel architectures lack the ability to proactively perceive and dynamically adapt to the micro-uncertainties within the system during operation, leading to problems such as inconsistent task execution speeds, data synchronization delays, and node timeout failures. Traditional fault-tolerance mechanisms cannot effectively solve the problems of overall efficiency decline and tail latency surge.
By employing a chaos-aware antifragile scheduling preparation mechanism, simulated micro-disturbance factors such as latency, packet loss, and memory pressure are injected into computing nodes and network links. The system monitors node status in real time, generates self-healing task scheduling strategies, dynamically adjusts task allocation and data flow, identifies and isolates high vulnerability points, and ensures that the system has proactive avoidance and fault tolerance capabilities under high-load computing.
It enhances the inherent robustness and operational stability of complex distributed analysis systems under long-term, high-concurrency tasks, ensuring that tasks are reliably completed within a preset time and avoiding load skew and task stagnation caused by resource contention and local performance degradation.
Smart Images

Figure CN122309302A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of multi-source data statistical technology, and more specifically, to a method for efficient statistical analysis of multi-source data based on a distributed parallel architecture. Background Technology
[0002] In the era of massive data, efficient statistical analysis of multi-source heterogeneous data based on distributed parallel architecture has become a key technology supporting intelligent decision-making in various fields. Existing technologies generally adopt a combination of "data parallelism" and "task parallelism," which significantly improves processing throughput by distributing data and computing tasks to a large number of nodes for simultaneous execution, and basically solves the efficiency bottleneck caused by insufficient computing power of a single machine.
[0003] As data volumes continue to grow, data sources become increasingly complex, and the real-time requirements of analytical tasks continue to rise, the deep-seated limitations of the classic distributed architecture, which is centered on "static resource partitioning" and "post-fault recovery," are becoming increasingly apparent. Specifically, existing methods typically assume a relatively stable internal environment within the computing cluster, and their task scheduling strategies are often based on static or semi-static planning using initial resource snapshots. This lacks proactive awareness and dynamic adaptation to the micro-uncertainties within the system during runtime. These uncertainties include, but are not limited to: instantaneous performance degradation of individual computing nodes due to garbage collection or resource contention; unpredictable instantaneous congestion or packet loss in network links; and load imbalances between different nodes caused by fluctuations in the rates of multi-source data streams. In long-duration, high-concurrency statistical analysis tasks, these micro-perturbations can lead to inconsistent task execution speeds, data synchronization delays, and even timeout failures of individual nodes. While traditional fault-tolerance mechanisms (such as task retries) can handle explicit failures, they cannot prevent overall efficiency decline and tail latency spikes caused by performance degradation at local "vulnerable points" within the system. Essentially, they are passive and lagging response strategies, making it difficult to guarantee the reliable completion of complex analytical tasks within a predetermined timeframe. Therefore, this invention proposes an efficient statistical analysis method for multi-source data based on a distributed parallel architecture to solve the above problems. Summary of the Invention
[0004] To achieve the above objectives, the present invention provides the following technical solution: An efficient statistical analysis method for multi-source data based on a distributed parallel architecture includes the following steps: Heterogeneous raw data is collected from multiple data interfaces and preprocessed to generate a standardized dataset for use by a pre-defined statistical analysis model. Before distributing standardized datasets to distributed computing node clusters for parallel statistical modeling and analysis, chaos-aware antifragile scheduling preparation is performed to dynamically generate self-healing task scheduling strategies. Based on the generated self-healing task scheduling strategy, the standardized dataset and statistical analysis model tasks are dynamically allocated to the selected computing node cluster. The distributed parallel computing framework is used to synchronously execute the parallel operation of the statistical analysis model on each node, and the node status is continuously monitored during the calculation process. The task allocation and data flow are adjusted in real time based on the updated vulnerability assessment results. The output results of all computing nodes are aggregated to generate a final statistical analysis report and visualization charts. Based on the data in the statistical analysis report, combined with preset data types and threshold rules, corresponding graded risk warning signals are generated.
[0005] In a preferred embodiment, preprocessing refers to standardized processing operations that include data cleaning, format conversion, and feature dimensionality reduction.
[0006] In a preferred embodiment, the preparation process includes: injecting controlled micro-chaotic perturbation factors, including random latency, selective packet loss, and simulated memory pressure, into computing nodes and data transmission paths; calculating and generating a system vulnerability assessment map in real time based on the response performance of each node and link to the chaotic perturbation factors, and identifying three types of system vulnerabilities: high, medium, and low; and dynamically generating a self-healing task scheduling strategy based on the vulnerability assessment map.
[0007] In a preferred embodiment, based on the system vulnerability assessment map generated during the chaos perception preparation process and parallel computing process, the identified high and medium-level system vulnerabilities are located and recorded, triggering a preset automated repair process.
[0008] In a preferred embodiment, the controlled microscopic chaotic perturbation factor includes: The delay disturbance factor, packet loss disturbance factor, and memory disturbance factor are used to simulate network jitter by generating random delay values according to the Poisson distribution model and dynamically injecting these delay values during data packet transmission. The packet loss disturbance factor simulates network packet loss by randomly dropping a specified proportion of data packets along the data transmission path using a preset reliable packet loss rate. The memory perturbation factor simulates sudden memory pressure by dynamically writing and maintaining a specified size of padding data blocks into the user-space memory of the target compute node.
[0009] In a preferred embodiment, the system vulnerability assessment map is calculated and generated in real time, specifically including: Monitor and record the performance index changes of each computing node and data transmission link after injecting controlled micro-chaotic perturbation factors. The performance indexes include task processing latency increment, data packet retransmission rate, memory reclamation frequency, and CPU utilization fluctuation. Based on changes in performance metrics, a weighted scoring method is used to calculate the vulnerability score of each computing node and data transmission link in real time. The weighted scoring method is to normalize the change values of each performance metric, multiply them by the corresponding weight coefficients, and sum them up. The weight coefficients are preset according to the real-time requirements and data consistency requirements of the statistical analysis task. Based on the vulnerability scores of each computing node and data transmission link, a system vulnerability assessment map is generated with computing nodes and network topology as nodes.
[0010] In a preferred embodiment, after generating the system vulnerability assessment map, the nodes and links are classified into vulnerability levels based on the calculated vulnerability scores. Set a first threshold and a second threshold, where the first threshold is greater than the second threshold; Nodes and links whose vulnerability scores exceed the first threshold are classified as high-vulnerability points; Nodes and links with vulnerability scores below the first threshold but above the second threshold are classified as medium-vulnerability points. Nodes and links that meet the second threshold of the vulnerability score are classified as low-vulnerability points; In the system vulnerability assessment map, high vulnerability points, medium vulnerability points and low vulnerability points are marked with different labels.
[0011] In a preferred embodiment, a self-healing task scheduling strategy is dynamically generated, specifically as follows: Based on the topological locations and interconnections of high-vulnerability, medium-vulnerability, and low-vulnerability points in the graph, a multi-objective task scheduling scheme is generated. The scheme aims to maximize the overall system's computational throughput and minimize the load on high-vulnerability points, while ensuring that all statistical analysis model tasks are completed within the preset deadline. When generating scheduling strategies, the paths of data flows to or originating from high-vulnerability points are replanned to direct them to medium-vulnerability and low-vulnerability points; for computing tasks assigned to medium-vulnerability points, resource usage upper limit constraints and checkpoint setting instructions are added. The self-healing task scheduling strategy is analyzed. Based on the requirements regarding the distribution of computing resources, load constraints, and the need to avoid high vulnerabilities, the standardized dataset is split into several data fragments that match the requirements in terms of size and quantity. Based on the path planning and node load constraints specified in the self-healing task scheduling strategy, each data shard and its corresponding statistical analysis model task descriptor are distributed to selected computing nodes that are not highly vulnerable.
[0012] In a preferred embodiment, the outputs of all computing nodes are aggregated to generate a final statistical analysis report and visualization charts, specifically including: The aggregate operators built into the distributed computing framework are used to perform a global reduction of the output results of all computing nodes. The aggregate operators are selected from either the reduction aggregation operation based on key-value pairs or the structured query aggregation operation based on data frames. The reduction aggregation operation based on key-value pairs specifically involves merging and accumulating the intermediate statistical results output by each node according to the statistical dimension key through the Reduce phase of the MapReduce framework or the reduceByKey operation of Apache Spark to obtain global statistical indicators. The structured query aggregation operation based on data frames specifically involves performing grouped statistical calculations on a distributed dataset through groupBy and aggregation functions of Apache Spark SQL to generate a structured result dataset containing summary statistics. The global statistical indicators or structured result datasets are formatted according to a preset report template, and the visualization engine is called to generate corresponding statistical charts, forming the final statistical analysis report.
[0013] The technical effects and advantages of this invention are as follows: This invention introduces a chaos-aware antifragile scheduling preparatory mechanism, integrating the proactive perturbation testing concept from chaos engineering into the resource scheduling field before the execution of distributed statistical analysis tasks. Instead of passively waiting for failures to occur, this method dynamically injects simulated micro-perturbation factors such as latency, packet loss, and memory pressure into computing nodes and network links under controlled conditions, proactively stimulating and detecting potential weaknesses in the system. By collecting real-time performance response data of each node and link under perturbation and performing quantitative evaluation and graphical representation, the system can accurately draw a dynamic system vulnerability heatmap before the actual high-load computing begins. This transforms system management from traditional predictive scheduling based on static resource views to dynamic adaptive scheduling based on real-time resilience assessment. Consequently, it possesses proactive avoidance and fault tolerance capabilities when facing unavoidable uncertainties in real-world operating environments, such as hardware performance fluctuations and instantaneous network congestion, fundamentally improving the inherent robustness and operational stability of complex distributed analysis systems under long-term, high-concurrency tasks.
[0014] This invention dynamically constructs and executes a self-healing task scheduling strategy based on a real-time generated system vulnerability assessment map, achieving intelligent and precise resource allocation. The process first temporarily isolates identified high-vulnerability nodes from the currently available resource pool to prevent them from affecting the overall task flow. Subsequently, the scheduling core dynamically calculates the optimal task allocation scheme based on the real-time performance capacity, data locality, and network topology of the remaining medium- and low-vulnerability nodes using a multi-objective optimization algorithm. This scheme not only pursues maximizing overall computational throughput but also emphasizes ensuring no node overload, particularly preventing medium-vulnerability nodes from evolving into new performance bottlenecks due to improper task allocation. During task execution, the system continuously monitors node status. Once a node's performance deviates from expectations or its vulnerability level increases, it triggers real-time fine-tuning of the scheduling strategy and task migration. This dynamic feedback adjustment mechanism, spanning the entire task lifecycle, effectively solves problems such as load skew and task stagnation caused by resource contention and local performance degradation in traditional static scheduling, thereby ensuring that massive data statistical analysis tasks can be completed within a preset time with higher resource utilization and a more reliable process. Attached Figure Description
[0015] To facilitate understanding by those skilled in the art, the present invention will be further described below with reference to the accompanying drawings; Figure 1 This is a schematic diagram of a method for efficient statistical analysis of multi-source data based on a distributed parallel architecture, as described in this invention. Detailed Implementation
[0016] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention.
[0017] Reference Figure 1 The following examples were obtained: Example 1: An efficient statistical analysis method for multi-source data based on a distributed parallel architecture, comprising the following steps: Heterogeneous raw data is collected from multiple data interfaces and preprocessed to generate a standardized dataset for use by a pre-defined statistical analysis model. Before distributing standardized datasets to distributed computing node clusters for parallel statistical modeling and analysis, chaos-aware antifragile scheduling preparation is performed to dynamically generate self-healing task scheduling strategies. Based on the generated self-healing task scheduling strategy, the standardized dataset and statistical analysis model tasks are dynamically allocated to the selected computing node cluster. The distributed parallel computing framework is used to synchronously execute the parallel operation of the statistical analysis model on each node, and the node status is continuously monitored during the calculation process. The task allocation and data flow are adjusted in real time based on the updated vulnerability assessment results. The output results of all computing nodes are aggregated to generate a final statistical analysis report and visualization charts. Based on the data in the statistical analysis report, combined with preset data types and threshold rules, corresponding graded risk warning signals are generated.
[0018] In one specific implementation, preprocessing refers to standardized processing operations including data cleaning, format conversion, and feature dimensionality reduction. First, diverse and heterogeneous raw data streams are accessed through various data interfaces. For example, in intelligent transportation scenarios, this data includes structured traffic flow records from road surface induction coils, semi-structured trajectory messages from vehicle GPS terminals, and unstructured event description text generated from traffic camera image analysis. The accessed raw data is then imported into a unified buffer storage area and undergoes preliminary integrity verification to identify obvious data format misalignments or transmission interruptions.
[0019] It is worth noting that in existing technological practices, the specific techniques for data cleaning, format conversion, and feature dimensionality reduction can be selected and combined according to different application scenarios, data characteristics, and resource constraints. For example, missing value handling can employ various methods such as mean imputation, median imputation, or model-based predictive imputation; data format standardization can be defined according to industry standards or internal specifications; and feature dimensionality reduction can utilize different algorithms such as principal component analysis, linear discriminant analysis, or model-based feature importance ranking. The ultimate goal of these techniques is common: to extract standardized datasets with uniform structure, controllable quality, and appropriate dimensionality from the original heterogeneous data, facilitating efficient processing by subsequent statistical analysis models. Similarly, the pre-defined statistical analysis models themselves have a wide range of choices in existing technologies, such as logistic regression, support vector machines, random forests, gradient boosting decision trees, or various neural network models, all of which can be selected according to specific analytical tasks and performance requirements.
[0020] In this embodiment, a deep cleaning operation is then performed on the raw data in the buffer storage area. For missing records in the structured traffic flow data, a neighbor interpolation method based on time series similarity is used for filling, that is, the traffic flow is estimated based on the traffic flow values of the same historical period with similar traffic conditions. For abnormal coordinate points in the semi-structured trajectory data, such as points with instantaneous speed jumps or locations that deviate significantly from the road network, a standard deviation threshold judgment method based on statistical distribution is used for identification and removal. Specifically, the standard deviation of the speed sequence of continuous trajectory points is calculated, and data points exceeding three times the standard deviation of the mean are judged as abnormal and removed. At the same time, a globally unique identifier comparison is performed on the data from all sources to accurately remove records that are repeatedly collected across data sources.
[0021] After cleaning, refined format conversion and semantic integration are performed. Track messages from GPS devices from different manufacturers are parsed and reassembled according to a unified spatiotemporal data model. All timestamps are converted to the UTC standard format, and all geographic coordinates are converted to latitude and longitude representation in the WGS-84 coordinate system. Unstructured text event descriptions, such as congestion and accidents, are converted into structured standard event codes using natural language processing technology. Based on this, the cleaned traffic flow data, standardized trajectory data, and event code data are correlated and fused according to key fields such as road segment ID and time window to form a wide-table dataset with consistent semantics.
[0022] Finally, feature engineering and dimensionality reduction are performed on the fused wide-table dataset for analysis. This core step is supported by different algorithm implementations. For example, when using the Markowitz algorithm as the preset statistical analysis model, the system calls its built-in feature selection submodule. This module automatically evaluates the significance of each feature dimension's impact on the target variable (such as congestion level) based on the principle of analysis of variance. For instance, it calculates the impact of 56 initially selected feature dimensions (including traffic flow, average speed, lane occupancy, weather indicators, event types, etc.) according to their variance contribution rate, retaining only 18 core feature dimensions with a cumulative contribution rate exceeding 95%, such as peak-hour traffic flow change rate and spatial average speed dispersion, thus efficiently eliminating redundant features. As a comparative example, if other conventional machine learning algorithms such as random forests are used, they rely on their built-in feature importance evaluation function, selecting features by calculating their split contribution in a large number of decision trees. Their computational efficiency and automation level under massive data are usually lower than the Markowitz algorithm feature selection mechanism designed specifically for high-dimensional data. After this step, a high-quality, low-dimensional standardized dataset is finally output, directly usable for subsequent distributed parallel modeling and analysis.
[0023] The preparation process includes: injecting controlled micro-chaotic perturbation factors, including random latency, selective packet loss, and simulated memory pressure, into computing nodes and data transmission paths; calculating and generating a system vulnerability assessment map in real time based on the response performance of each node and link to the chaotic perturbation factors, and identifying three types of system vulnerabilities: high, medium, and low; and dynamically generating a self-healing task scheduling strategy based on the vulnerability assessment map.
[0024] In one specific implementation, when the system vulnerability assessment map generated during the chaos perception preparation process and parallel computing continuously identifies high- or medium-vulnerability points, a pre-defined automated repair process is triggered. This process first initiates root cause diagnosis and repair strategy matching. For example, if the map shows a computing node marked as a high-vulnerability point with its main performance degradation indicators being a surge in task processing latency and a CPU utilization consistently above 95%, the diagnostic logic will attribute this to the node's instantaneous excessive load. Simultaneously, if another node is marked as a medium-vulnerability point with its main characteristic being an abnormally high memory reclamation frequency, the diagnostic logic may point to a potential memory leak risk. Based on different diagnostic conclusions, the repair process matches and generates a set of serialized, executable repair action instructions from a pre-defined strategy library.
[0025] After matching is complete, the repair process enters the atomic execution phase of the instruction sequence. For high-vulnerability nodes diagnosed as having excessive admission load, the executed repair instruction sequence typically includes: First, sending an instruction to the resource scheduler to suspend the allocation of new computing tasks to that node; second, querying the current list of low-vulnerability nodes in the cluster and selecting one or more optimal target nodes based on their remaining computing resources and data locality; next, initiating a computing task migration operation to safely transfer the unfinished computing tasks and their status checkpoints on the high-vulnerability node to the selected target node for continued execution; finally, marking the original high-vulnerability node as isolated and adding it to the resource pool to await subsequent infrastructure inspection or restart for recycling. For medium-vulnerability nodes diagnosed as having memory leak risk, the executed instructions may differ. For example, first, attempting to trigger a graceful restart of a specific computing service on that node to release suspicious memory; if the same metric deteriorates again in a short period after restarting, automatically escalating the repair strategy and triggering the reconstruction and rescheduling of the entire computing container on that node.
[0026] During the execution of each repair instruction, the monitoring system synchronously tracks the execution results of the repair actions and the status feedback of the target node. For example, during task migration, the monitoring system confirms whether data shards are transmitted completely, whether the computation status is successfully loaded, and whether the target task has resumed normal progress on the new node. The start time, execution steps, key parameters, and final results of all repair operations are structured and recorded in the repair log. This log is associated with the corresponding node entries in the system vulnerability assessment map, forming a complete traceability chain.
[0027] After the repair action is completed, the automated process does not end immediately, but enters a pre-set observation period for effect verification and strategy optimization. During this phase, the monitoring system continuously monitors the performance metrics of the repaired nodes (such as isolated nodes) and the new nodes accepting the migration task. The verification logic compares key metrics before and after the repair, such as whether computation latency has returned to the normal range and whether memory reclamation frequency has fallen back to the baseline level. If the verification passes, the repair process is marked as successful; if the repair fails, it is reported to manual verification and handling.
[0028] In one specific implementation, the chaos-aware antifragile scheduling preparation process is initiated before the distributed computing task begins. This process first constructs a controlled chaotic perturbation environment, specifically through three parallel perturbation operations. The latency perturbation operation generates a series of random latency values based on a Poisson distribution model, with the average interval dynamically adjusted by the current network baseline latency. These latency values are then dynamically injected into normally transmitted data packets along the data transmission paths between selected computing nodes, simulating real-world network jitter. Simultaneously, the packet loss perturbation operation, based on a preset probability, selects and discards a specific proportion of data packets in a pseudo-random manner on critical data transmission links, simulating data loss caused by network congestion or hardware failure. The memory perturbation operation targets the target computing node, dynamically requesting and writing a specified-size, meaningless data block into the node's user-space memory space through an independent background process. This data block is then maintained in memory for a configurable period, artificially creating a sudden surge in memory resource stress.
[0029] While the controlled chaotic perturbation is being executed, a comprehensive monitoring mechanism begins to synchronously collect response data from the entire distributed environment. This monitoring mechanism continuously records key performance indicators for each computing node during the perturbation period, including the increase in the average processing time of a single task compared to the baseline time, the proportion of data packet retransmissions triggered by simulated packet loss, the frequency of memory garbage collection, and the fluctuation range of CPU utilization. These fine-grained indicators can be transmitted in real time to a central analysis unit for aggregation and processing.
[0030] The central analysis unit initiates a vulnerability quantification assessment based on the aggregated indicator data. This assessment employs a weighted scoring algorithm. First, the change values of each performance indicator are normalized to eliminate the influence of dimensions. Then, each normalized value is multiplied by its corresponding dynamic weight coefficient. Finally, all weighted results are summed to calculate the real-time vulnerability score for each computing node and each data transmission link. The weight coefficients are not fixed but dynamically adjusted according to the characteristics of the statistical analysis task to be executed. For setting the weight coefficients, existing technologies can refer to priority weighting algorithms commonly used in network service quality (QoS) management or operating system process scheduling. The importance of different performance indicators is assigned different weight values based on business objectives. For example, for financial transaction fraud detection tasks with extremely high real-time requirements, the weight of task processing latency increments will be significantly increased. Based on the calculated vulnerability scores of all nodes and links, the analysis unit automatically generates a system vulnerability assessment map, which visually displays the health status of each component in the cluster in the form of a topology diagram.
[0031] After the vulnerability assessment map is generated, the system automatically classifies the elements in the map according to a preset threshold strategy. This strategy sets two numerical thresholds: a first threshold and a second threshold, with the first threshold being greater than the second. The setting of the first and second thresholds is based on existing technologies commonly found in multi-level alarm systems, such as IT infrastructure monitoring or industrial control systems, where multiple progressive thresholds are set to differentiate the severity of faults or performance degradation, enabling tiered responses. Nodes and links with vulnerability scores exceeding the first threshold are classified as high-vulnerability points, indicating they are in a highly unstable state and unreliable for computational tasks. Nodes with vulnerability scores below the first threshold but exceeding the second threshold are classified as medium-vulnerability points, indicating performance degradation requiring constraints. Nodes with vulnerability scores below the second threshold are classified as low-vulnerability points, considered healthy resources. At the map visualization level, high, medium, and low vulnerabilities are prominently marked with red, yellow, and green indicators, respectively, providing a clear basis for subsequent scheduling decisions. After this preparatory process is completed, the complete vulnerability assessment map will serve as a key input to drive the dynamic generation of subsequent self-healing task scheduling strategies.
[0032] In one specific implementation, the process of dynamically generating a self-healing task scheduling strategy takes a system vulnerability assessment map as input. This map clearly identifies high-vulnerability, medium-vulnerability, and low-vulnerability nodes and their topological connections. When generating the strategy, all nodes marked as high-vulnerability nodes in the map are first temporarily excluded from the currently available computing resource pool, retaining only medium-vulnerability and low-vulnerability nodes as candidate computing nodes. Subsequently, based on the resource capacity of the remaining candidate nodes, network link status, and the computational requirements of the statistical analysis tasks to be processed, a multi-objective optimization model is constructed. The objective function of this model aims to maximize the overall cluster's computational throughput, minimize the load imbalance between nodes, and ensure that all tasks are completed before the preset deadline. The calculation of load imbalance pays particular attention to medium-vulnerability nodes to prevent them from evolving into new high-vulnerability nodes due to overload.
[0033] To solve this optimization model, various existing intelligent optimization algorithms can be employed. Genetic algorithms and simulated annealing are also viable options. Taking the genetic algorithm as an example, its fitness function design must closely align with the aforementioned multi-objectives. A specific method for calculating the fitness score can be expressed as: Fitness = W1 Estimated total throughput normalized value + W2 (1 - Load Balancing) + W3 On-time completion rate prediction. Here, W1, W2, and W3 are weighting coefficients dynamically set based on task urgency and data importance; load balancing is typically measured by the standard deviation or Gini coefficient of the computational load across all candidate nodes; the on-time completion rate prediction algorithm is based on the comparison between the estimated execution time of each task on the assigned node and its deadline. Simulated annealing, by defining a similar energy function and controlling the temperature parameter to decrease, accepts inferior solutions with a certain probability to escape local optima, ultimately searching for a feasible scheduling scheme.
[0034] After obtaining the optimized scheduling scheme, the standardized dataset needs to be split into several matching data fragments based on the task allocation ratio and resource constraints for each candidate node in the scheme. The core requirement for the splitting process is that the size of each data fragment should be proportional to the processing capacity of the target node, while also considering data locality to minimize network transmission. A specific splitting algorithm can be described as follows: First, calculate the theoretical proportion P_i of the data volume that the node should process based on the task allocation ratio for each node in the scheduling scheme. Then, fine-tune P_i by combining the actual available I / O bandwidth and memory resources of each node to obtain the final data allocation weight W_i. Finally, divide the original dataset into consecutive blocks sequentially or randomly according to W_i, or adopt a partitioning strategy based on key field hashing to ensure that the size of each data fragment matches W_i, and that data from the same logical group is allocated to the same node as much as possible. For example, in a traffic flow analysis scenario, if the scheme decides to allocate 30% of the vehicle trajectory data to node A, the splitting logic will ensure that trajectory records belonging to the same time period or the same road segment are concentrated as much as possible, and split the data into fragments of the corresponding amount proportionally.
[0035] Based on the task descriptions and node mappings explicitly defined in the scheduling scheme, the specific operations for task and data distribution are initiated. Each data shard, along with its corresponding statistical analysis model configuration parameters and initialization instructions, is packaged into an independently executable work unit. These work units are distributed in parallel to their respective designated candidate computing nodes via efficient remote procedure calls or message queues. Throughout the entire task execution cycle, changes in resource utilization, task progress, and vulnerability indicators of each node are continuously monitored. Once the actual load of a vulnerable point is detected to be continuously approaching its preset upper limit constraint, or new performance bottlenecks are discovered, the scheduling strategy is immediately re-evaluated and iteratively optimized, generating updated distribution instructions and dynamically reallocating unfinished tasks, thereby achieving self-adaptation and self-healing of the entire computing process.
[0036] In one specific implementation, after all computing nodes complete the statistical analysis of their assigned data shards, the intermediate results are collected and integrated for final consolidation. This process first requires selecting a suitable data format from two mainstream distributed aggregation paradigms. If the outputs of each node are intermediate statistics organized in key-value pairs—for example, in traffic flow analysis, each node outputs a list with "segment ID-time period" as the key and "traffic flow, average speed" tuples as values—then a key-value pair-based reduction aggregation operation is used. This operation can be implemented through the corresponding mechanisms of the underlying distributed computing framework, such as calling the `reduceByKey` or `aggregateByKey` operator in the Apache Spark environment, specifying a user-defined function to accumulate and average values for the same segment and time period, thereby merging the local statistical results output by all nodes into a global traffic and speed index for the entire network. Conversely, if the node outputs well-structured row-column data, such as data rows containing fields like "segment ID", "timestamp", and "congestion level", then using data frame-based structured query aggregation operations is more efficient. SparkSQL's groupBy statement can be used to group by segment and time window, and built-in aggregation functions such as avg, max, and count can be used to directly generate a summarized statistical data set.
[0037] After selecting the aggregation operation and completing the global calculation, a set of machine-readable, typically discrete, numerical or structured data records is obtained. Next, this raw data needs to be transformed into a standardized report that is easy for humans to understand and use for decision-making. This transformation process follows a predefined report template, which specifies the report's chapter structure, core indicators to be presented, table format, and chart types. For example, when generating a daily intelligent transportation report, the template would require the presentation of sections such as "Overall Network Traffic Overview," "Key Area Congestion Ranking," and "Abnormal Event Statistics." The system extracts the corresponding fields and values from the aggregated global dataset and automatically fills them in according to the template's format requirements, generating a preliminary report document containing a text summary and key data tables.
[0038] While generating structured report documents, the system simultaneously calls the visualization engine to generate corresponding statistical charts for core indicators that require intuitive display of trends, comparisons, or distributions. The visualization engine automatically matches and renders the most suitable chart based on the type of data indicator and the report's intent. For example, to show the average speed change trend across the entire road network at different times, the engine will draw a line chart; to compare the daily average number of traffic jams in different administrative regions, a bar chart will be generated; and for a heat map of the geographical distribution of abnormal events in the road network, the engine may combine it with geographic information system components for rendering. All generated charts are automatically inserted into designated locations in the report document, integrated with text and tables to form a final statistical analysis report that is rich in both text and graphics.
[0039] In one specific implementation, the process of generating tiered risk warning signals based on the generated statistical analysis report begins with parsing the structured content of the report and extracting key data fields. This process first identifies the specific data items to be monitored in the report and their data types based on a predefined rule configuration table. Data types are mainly divided into numerical continuous variables and categorical discrete variables. Taking a statistical analysis report for intelligent transportation as an example, a typical representative of numerical variables is the average congestion level of the entire road network, while categorical variables may include frequently occurring event type labels, such as "accident" or "control". The system automatically locates and extracts the current period values of these target fields from the metadata accompanying the report's text summary, data tables, or charts.
[0040] After extracting the target data, the system enters the threshold matching and rule application phase. To this end, the system maintains a dynamic threshold rule library, which presets multiple levels of thresholds for each type of monitoring data. For numerical variables, threshold rules are typically set based on statistical process control principles. For example, a first-level threshold might be set as "the historical mean plus three standard deviations," a second-level threshold as "the mean plus two standard deviations," and a third-level threshold as "the mean plus one standard deviation." For categorical variables, threshold rules might be based on their frequency of occurrence. For example, an "accident" event exceeding 15% of total events triggers a first-level warning, while an event between 10% and 15% triggers a second-level warning. Before applying the thresholds, the system first calls the historical baseline corresponding to the current data. This baseline may be dynamically adjusted based on season, day of the week, or special dates to ensure that the thresholds are consistent with the current context.
[0041] After threshold matching is completed, the system executes its core risk level determination logic. This logic compares the current value of each monitored data item with its corresponding multi-level thresholds. Taking the average road network congestion index as an example, the system first determines whether it exceeds the first-level threshold (historical mean + 3 standard deviations). If it does, a first-level risk warning signal is immediately generated for that indicator, and the specific threshold and data value are recorded. If it does not exceed the threshold, the system continues to determine whether it exceeds the second-level threshold, and so on. The determination logic is similar for categorical variables, such as the proportion of "accident" events. All determination processes are executed automatically, generating a structured record for each triggered warning signal that includes elements such as a timestamp, data item, trigger value, threshold, and risk level.
[0042] According to preset distribution rules, different levels of warning signals are pushed to the terminals of different responsible persons through an integrated message gateway. Level 1 warnings may trigger SMS, in-app push notifications, and telephone notifications simultaneously, while Level 3 warnings may be sent only through in-app notifications. The triggering, distribution, and confirmation status of all warnings are fully recorded.
[0043] The above algorithms or formulas are all dimensionless and numerical calculations, and the results are obtained by software simulation based on a large amount of collected data to obtain the most recent real-world results. The preset parameters are set by those skilled in the art according to the actual situation.
[0044] It should be understood that in the various embodiments of this application, the order of the above-mentioned processes does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
[0045] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0046] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the devices and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0047] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A multi-source data efficient statistical analysis method based on a distributed parallel architecture, characterized in that, Includes the following steps: Heterogeneous raw data is collected from multiple data interfaces and preprocessed to generate a standardized dataset for use by a pre-defined statistical analysis model. Before distributing standardized datasets to distributed computing node clusters for parallel statistical modeling and analysis, chaos-aware antifragile scheduling preparation is performed to dynamically generate self-healing task scheduling strategies. Based on the generated self-healing task scheduling strategy, the standardized dataset and statistical analysis model tasks are dynamically allocated to the selected computing node cluster. The distributed parallel computing framework is used to synchronously execute the parallel operation of the statistical analysis model on each node, and the node status is continuously monitored during the calculation process. The task allocation and data flow are adjusted in real time based on the updated vulnerability assessment results. The output results of all computing nodes are aggregated to generate a final statistical analysis report and visualization charts. Based on the data in the statistical analysis report, combined with preset data types and threshold rules, corresponding graded risk warning signals are generated. 2.The method of claim 1, wherein, Preprocessing refers to standardized processing operations that include data cleaning, format conversion, and feature dimensionality reduction. 3.The method of claim 2, wherein, The preparation process includes: injecting controlled micro-chaotic perturbation factors, including random latency, selective packet loss, and simulated memory pressure, into computing nodes and data transmission paths; calculating and generating a system vulnerability assessment map in real time based on the response performance of each node and link to the chaotic perturbation factors, and identifying three types of system vulnerabilities: high, medium, and low; and dynamically generating a self-healing task scheduling strategy based on the vulnerability assessment map.
4. The efficient statistical analysis method for multi-source data based on a distributed parallel architecture according to claim 3, characterized in that, Based on the system vulnerability assessment map generated during the chaos perception preparation process and parallel computing process, the identified high and medium-level system vulnerabilities are located and recorded, triggering a preset automated repair process.
5. The efficient statistical analysis method for multi-source data based on a distributed parallel architecture according to claim 4, characterized in that, Controlled microscopic chaotic perturbation factors include: The delay disturbance factor, packet loss disturbance factor, and memory disturbance factor are used to simulate network jitter by generating random delay values according to the Poisson distribution model and dynamically injecting these delay values during data packet transmission. The packet loss disturbance factor simulates network packet loss by randomly dropping a specified proportion of data packets along the data transmission path using a preset reliable packet loss rate. The memory perturbation factor simulates sudden memory pressure by dynamically writing and maintaining a specified size of padding data blocks into the user-space memory of the target compute node.
6. The efficient statistical analysis method for multi-source data based on a distributed parallel architecture according to claim 5, characterized in that, Real-time calculation and generation of system vulnerability assessment maps, specifically including: Monitor and record the performance index changes of each computing node and data transmission link after injecting controlled micro-chaotic perturbation factors. The performance indexes include task processing latency increment, data packet retransmission rate, memory reclamation frequency, and CPU utilization fluctuation. Based on changes in performance metrics, a weighted scoring method is used to calculate the vulnerability score of each computing node and data transmission link in real time. The weighted scoring method is to normalize the change values of each performance metric, multiply them by the corresponding weight coefficients, and sum them up. The weight coefficients are preset according to the real-time requirements and data consistency requirements of the statistical analysis task. Based on the vulnerability scores of each computing node and data transmission link, a system vulnerability assessment map is generated with computing nodes and network topology as nodes.
7. The efficient statistical analysis method for multi-source data based on a distributed parallel architecture according to claim 6, characterized in that, After generating the system vulnerability assessment map, the nodes and links are classified into vulnerability levels based on the calculated vulnerability scores; Set a first threshold and a second threshold, where the first threshold is greater than the second threshold; Nodes and links whose vulnerability scores exceed the first threshold are classified as high-vulnerability points; Nodes and links with vulnerability scores below the first threshold but above the second threshold are classified as medium-vulnerability points. Nodes and links that meet the second threshold of the vulnerability score are classified as low-vulnerability points; In the system vulnerability assessment map, high vulnerability points, medium vulnerability points and low vulnerability points are marked with different labels.
8. The efficient statistical analysis method for multi-source data based on a distributed parallel architecture according to claim 7, characterized in that, Dynamically generate a self-healing task scheduling strategy, specifically as follows: Based on the topological locations and interconnections of high-vulnerability, medium-vulnerability, and low-vulnerability points in the graph, a multi-objective task scheduling scheme is generated. The scheme aims to maximize the overall system's computational throughput and minimize the load on high-vulnerability points, while ensuring that all statistical analysis model tasks are completed within the preset deadline. When generating scheduling strategies, the paths of data flows to or originating from high-vulnerability points are replanned to direct them to medium-vulnerability and low-vulnerability points; for computing tasks assigned to medium-vulnerability points, resource usage upper limit constraints and checkpoint setting instructions are added. The self-healing task scheduling strategy is analyzed. Based on the requirements regarding the distribution of computing resources, load constraints, and the need to avoid high vulnerabilities, the standardized dataset is split into several data fragments that match the requirements in terms of size and quantity. Based on the path planning and node load constraints specified in the self-healing task scheduling strategy, each data shard and its corresponding statistical analysis model task descriptor are distributed to selected computing nodes that are not highly vulnerable.
9. The efficient statistical analysis method for multi-source data based on a distributed parallel architecture according to claim 8, characterized in that, The outputs of all computing nodes are aggregated to generate a final statistical analysis report and visualization charts, including: The aggregate operators built into the distributed computing framework are used to perform a global reduction of the output results of all computing nodes. The aggregate operators are selected from either the reduction aggregation operation based on key-value pairs or the structured query aggregation operation based on data frames. Global statistical indicators are obtained through reduction aggregation based on key-value pairs, and structured query aggregation based on data frames generates a structured result dataset containing summary statistics. The global statistical indicators or structured result datasets are formatted according to a preset report template, and corresponding statistical charts are generated to form the final statistical analysis report.