Campus Report Generation Optimization System Based on Data Hierarchical Caching and Parallel Computing

By constructing a three-level hierarchical caching architecture and optimizing resource allocation through parallel computing, the long-tail bottleneck problem caused by data skew in the campus report generation system was solved, achieving efficient data access and utilization of computing resources, and improving the stability and efficiency of the system.

CN121960419BActive Publication Date: 2026-06-30HUNAN HENGWEI COMMUNICATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUNAN HENGWEI COMMUNICATION TECHNOLOGY CO LTD
Filing Date
2026-03-30
Publication Date
2026-06-30

Smart Images

  • Figure CN121960419B_ABST
    Figure CN121960419B_ABST
Patent Text Reader

Abstract

This invention discloses a campus report generation optimization system based on hierarchical data caching and parallel computing, belonging to the field of big data processing technology. The system includes: a data acquisition and preprocessing module for acquiring raw report data and performing normalization and partitioning; a hierarchical data caching and storage module for constructing a three-level hierarchical caching architecture and evaluating cache heat; a parallel computing and dynamic load balancing module for optimizing resource allocation and adjusting load balancing in real time; and a dynamic skew processing and adjustment module for using hierarchical sampling and identifying skew keys, performing integration and verification to generate reports and storing them in the cache layer. This invention effectively improves the efficiency and stability of campus report generation, solving the problem of data skew in distributed computing causing some tasks to become bottlenecks and slowing down overall report generation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of big data processing technology, specifically to a campus report generation optimization system based on data hierarchical caching and parallel computing. Background Technology

[0002] With the deepening of campus informatization, the amount of data generated by various business systems is growing exponentially. Campus report generation faces challenges such as massive data scale, high computational complexity, and increasing real-time requirements. Existing technologies typically use distributed computing frameworks for parallel processing when handling large-scale data aggregation. However, due to the inherent uneven distribution of dimensions in campus data (such as differences in class sizes and popular business data sets), data skew is a prominent issue. Some Reduce tasks process far more data than others, forming a long-tail bottleneck that slows down the overall report generation progress.

[0003] For example, the invention patent with announcement number CN118134715B discloses an integrated intelligent campus enrollment and fee management system, including: an intelligent enrollment management module, a one-stop payment management module, a dynamic financial management module, a data analysis and decision management module, and a security management module. This invention achieves intelligent management of the entire process from enrollment information release, registration review, fee payment to financial statement generation through intelligent integration, and performs in-depth data mining and real-time analysis. The one-stop payment management module supports diversified payment methods, provides installment payment options, ensures transparency of fee details, and can automatically generate electronic invoices. The dynamic financial management module realizes real-time payment notifications, early warning notifications, and financial statement generation, enabling schools to monitor their financial situation in real time and make timely decisions. Through blockchain evidence storage technology and a strict permission log auditing mechanism, efficiency and security are improved.

[0004] For example, the invention patent with announcement number CN117474539B discloses an intelligent management method for campus card data, including: collecting current consumption data, historical consumption data, and current day consumption data; obtaining the consumption habit interval of the current consumption data to obtain consumption time-consistent data; obtaining the habit coefficient of the current consumption data in the consumption habit interval and iteratively updating the initial interval parameter and the consumption habit interval of the current consumption data to obtain the optimal interval parameter and the optimal consumption habit interval of the current consumption data; obtaining the consumption correction factor of the current consumption data based on the optimal interval parameter; obtaining the balance correction factor of the current consumption data to obtain the correction coefficient of the current consumption data; obtaining the correction outlier factor of the current consumption data to perform anomaly judgment on the current consumption data; the present invention aims to accurately judge the anomalies of the current consumption data and protect users' property.

[0005] In existing technologies, current systems lack tiered and popularity-aware cache utilization, leading to high latency in accessing hot data and wasted computing resources. Furthermore, static task scheduling strategies struggle to adapt to dynamic changes in node resources, easily causing uneven resource allocation. During group aggregation, some Reduce tasks process significantly more data than others, becoming long tails and slowing down the overall job completion time. This imbalance is particularly pronounced in student consumption statistics, where related operations on popular classes exacerbate the problem.

[0006] Therefore, in order to address the above issues, there is an urgent need for a campus report generation optimization system based on data hierarchical caching and parallel computing. Summary of the Invention

[0007] Technical problems to be solved

[0008] To address the shortcomings of existing technologies, this invention provides a campus report generation optimization system based on data hierarchical caching and parallel computing, which solves the problem that data skew in distributed computing causes some tasks to become bottlenecks, slowing down the overall report generation.

[0009] Technical solution

[0010] To achieve the above objectives, this invention provides the following technical solution: a campus report generation optimization system based on hierarchical data caching and parallel computing, comprising a data acquisition and preprocessing module for acquiring raw report data and performing time-axis relocation, cleaning, normalization, and partitioning / sharding on the raw report data; a hierarchical data caching and storage module for constructing a three-level hierarchical caching architecture based on the preprocessed raw report data, evaluating cache heat, and performing cache data hierarchy degradation processing; a parallel computing and dynamic load balancing module for optimizing resource allocation based on the execution status and resource consumption of computing tasks, adjusting load balancing in real time, and constructing a fault tolerance mechanism and a task retry mechanism; and a dynamic skew processing and adjustment module for using hierarchical sampling and identifying skew keys, performing hierarchical scattering aggregation and operator processing to obtain aggregation results, performing integration and verification to generate reports, and storing them in the cache layer.

[0011] Furthermore, the specific process of acquiring the raw report data and performing timeline relocation, cleaning, normalization, and segmentation on the raw report data is as follows: The entire process of collecting all detailed data and basic dimension data generated from the campus reports is performed to obtain the raw report data. The raw report data includes multi-dimensional time markers, aggregated dimension markers, terminal markers, associated entity markers, logically related indicators, and status markers. By introducing an NTP time server to synchronize the clocks of each data source, and using a sliding window mean filtering algorithm to calculate the average deviation of each dimension's time markers, the raw report data is timeline relocated, and the time markers of each dimension are extracted and calculated. The average time deviation is calculated, and the calibrated standardized business time is used as the unique statistical time identifier. The original report data is cleaned by matching each rule set, which includes test identifiers, abnormal status codes, and key field integrity checks, and invalid data of test, abnormal, and missing key fields are filtered out. Various dimension identifiers, terminal identifiers, and associated entity identifiers are normalized by using a unified coding standard and linear mapping method. Based on the report aggregation dimension, a consistent hash algorithm is used to perform preliminary partitioning and sharding of the normalized original report data, and the data volume and access frequency of each shard are counted to mark hot data shards and potentially skewed shards.

[0012] Furthermore, the specific process of constructing a three-level hierarchical caching architecture based on the preprocessed report raw data is as follows: Based on the sharding characteristics and access frequency of the preprocessed report raw data, a three-level hierarchical caching architecture is constructed, consisting of a raw detail caching layer, an intermediate aggregation caching layer, and a report result caching layer. The raw detail caching layer uses node-local memory-level caching to store hot data shards. The intermediate aggregation caching layer uses distributed SSD-level caching to store preliminary aggregation data of various dimensions generated during parallel computing. The report result caching layer uses a distributed cache cluster to store the generated standardized report data. A dual mechanism of data version number and readiness flag is used to manage cache updates. Each time a report is generated or data is calculated, the data in the cache is verified according to the update timestamp and readiness flag. The cached data is considered valid only when the data version number is consistent and the readiness flag is in a completed state. When the underlying data source changes, the new version of the data replaces the old version entirely through an atomic switching mechanism at the sharding granularity.

[0013] Furthermore, the specific process for cache popularity evaluation is as follows: The actual access frequency of data per unit time is obtained by monitoring and statistically analyzing the number of access requests; the maximum effective update cycle is obtained by statistically analyzing the update frequency of cached data; the time interval between the current data and the last update is obtained by comparing the timestamp of the current data with that of the last update; the actual access frequency is multiplied by the maximum effective update cycle to obtain the actual popularity; the time interval between the data and the last update is added to the maximum effective update cycle to obtain the cache popularity; the actual popularity is divided by the cache popularity, and an exponential operation is performed using a popularity weighting index to obtain the cache popularity evaluation value; a preloading mechanism is used for the original detailed cache layer in real time. The system compares the cache heat assessment value with the loading threshold. When the cache heat assessment value exceeds the loading threshold, the data is considered hot data and is loaded into the cache in advance. Before loading, the readiness flag of the data is checked. If the data is not fully ready, the loading request is intercepted, and the old version of the data is returned or the data is loaded after it is ready. When the cache heat assessment value does not reach the loading threshold, preloading is not performed, and the data is loaded on demand. For data that is not in the original detailed cache layer, an on-demand loading mechanism is adopted, and cache data is only loaded when a calculation or query request is triggered. The system determines whether data needs to be loaded from the cache based on the real-time calculated cache heat assessment value to ensure that cache resources are not occupied.

[0014] Furthermore, the specific process for downgrading cached data layers is as follows: When the popularity assessment value of cached data is lower than the eviction threshold, the data is downgraded from the original detail cache layer to the intermediate aggregation cache layer, or from the intermediate aggregation cache layer to the report result cache layer, or the data is directly cleared; the data consistency of each cache layer is monitored in real time. By comparing the change records of cached data with those of the underlying data source, when the underlying data source is updated, the priority of the change record is defined according to the impact of the changed field on the statistical scope of the report. The priority of the cache popularity assessment value and the data source change record are compared in real time. When the popularity assessment value exceeds the loading threshold and the data changes, the incremental update of the corresponding cache layer will be triggered. An incremental update mechanism is adopted to update only the modified part of the data. During the incremental update process, the modified part of the data is marked as temporary. After all updates are completed, the new data is made effective through atomic switching.

[0015] Furthermore, the specific process of optimizing resource allocation based on the execution status and resource consumption of computing tasks is as follows: The total resource quantity of the computing node is obtained through its hardware configuration; the task execution progress of the computing node is obtained by real-time monitoring of its task execution status; the total time consumed by the computing task is obtained by monitoring its start and end times; and the remaining resources of the computing node are obtained by real-time monitoring of its remaining resources. A resource allocation evaluation model is then constructed. The sum of the task execution progress of the computing node plus 1 is used to perform an exponential operation on the progress weighting coefficient to obtain a progress correlation value. The total time consumed by the computing task is divided by the remaining resources of the computing node, and an exponential operation on the resource priority coefficient is performed to obtain a resource correlation value. The progress correlation value and the resource correlation value are multiplied to obtain the denominator. The total resource quantity of the computing node is divided by the denominator to obtain the optimized resource allocation amount.

[0016] Furthermore, the specific process of real-time load balancing adjustment is as follows: The optimized resource allocation of each node is compared with the ratio threshold in real time. When the optimized resource allocation of a node is lower than the ratio threshold, tasks currently being executed on that node whose execution progress is lower than the progress threshold or whose resource consumption is higher than the consumption threshold, and which have checkpoints or recalculation capabilities, are marked as tasks to be migrated. These tasks are then assigned to computing nodes with optimized resource allocations higher than the ratio threshold for execution. For newly submitted computing tasks, initial allocation is performed based on the current optimized resource allocation of each node, prioritizing the assignment of new tasks to the computing node with the highest optimized resource allocation. The optimized resource allocation is periodically updated during task execution.

[0017] Furthermore, the specific process of constructing the fault tolerance mechanism and task retry mechanism is as follows: Monitor the execution status of computing tasks. When it is detected that a computing task fails due to a failure of the computing node or the remaining resources of the node are lower than the resource threshold, the fault tolerance mechanism is triggered. The failed computing task is marked as a task to be retried. Based on the optimized resource allocation of each available computing node, the task to be retried is assigned to the available computing node with the highest optimized resource allocation for re-execution. When retrying a failed computing task, check whether there is complete intermediate data in the cache. Complete intermediate data must meet the following requirements: the version number is consistent with the version number recorded in the task execution context, the ready flag is marked as completed, and it contains all the fragment keys required by the task. If it exists, the data is loaded directly and execution continues from the interruption point. Otherwise, the expired or incomplete data in the cache is discarded, and the entire computing task is re-executed.

[0018] Furthermore, the specific process of obtaining the aggregation result by adopting hierarchical sampling and identifying skew keys, and performing hierarchical shuffling aggregation and operator processing is as follows: Based on the aggregated data of the intermediate aggregation cache layer, hierarchical sampling is performed according to the data sharding level. Samples are extracted for each aggregation dimension feature key in each shard according to the sampling ratio. The frequency of occurrence of each feature key in the samples is counted, and the actual data volume of each feature key is estimated according to the sampling ratio. The ratio of the actual data volume of each feature key to the average data volume of all aggregation dimension feature keys is calculated, and the ratio is recorded as the data skew judgment value. Skews with a data skew judgment value greater than the skew threshold are identified. Feature keys are used to construct a skewed feature key dataset. An adaptive partitioning strategy is adopted for the skewed feature key dataset. Skewed keys with a skew judgment value greater than the skew threshold and less than or equal to the second-order skew threshold are used as ordinary skewed feature keys. A random prefix is ​​added to scatter them into M temporary computing partitions. Each temporary partition independently completes local aggregation calculations, and then the results of each local aggregation are merged to obtain the aggregated result of the feature keys. Skewed keys with a skew judgment value greater than the second-order skew threshold are used as second-order feature keys. A dedicated aggregation calculation operator is started separately, and independent computing resources and dedicated cache shards are allocated.

[0019] Further, the specific process of integrating, verifying, generating reports, and storing them in the cache layer is as follows: The aggregated results after processing the skewed feature keys are uniformly integrated. Combined with the general statistical standards for campus reports, the integrated aggregated data undergoes multi-dimensional verification. This includes verifying the completeness of records for each dimension indicator by comparing the difference between the actual number of records and the theoretical number of records calculated based on the dimension table cardinality and time granularity to see if it exceeds the missing rate threshold; verifying the data consistency between related dimensions by calculating whether the deviation of logically related indicators between dimensions exceeds the consistency coefficient threshold; verifying whether indicator values ​​are within a reasonable range by comparing indicator values ​​with the upper and lower limits defined by the statistical standards; verifying whether time series data meets the continuity requirements by detecting whether the fluctuation amplitude of the same indicator value within adjacent time periods exceeds the fluctuation threshold; if any data version is detected as not ready, an anomaly repair process is triggered, performing data supplementation, recalculation, or recovery from the cache according to the anomaly type; after verification, standardized campus report data is generated and stored in the report result cache layer, while simultaneously updating the report data version number and readiness flag.

[0020] Beneficial effects

[0021] The present invention has the following beneficial effects:

[0022] (1) This invention performs time axis relocation, cleaning and normalization and consistent hash partitioning on the original data of campus reports through the data acquisition and preprocessing module, and marks hot data partitions and potential skew partitions, thereby realizing the standardization and preliminary load partitioning of multi-source heterogeneous data, and providing a high-quality and perceptible data foundation for hierarchical caching and skew processing.

[0023] (2) This invention constructs a three-level layered cache architecture consisting of an original detail cache layer, an intermediate aggregation cache layer, and a report result cache layer, and introduces a cache heat evaluation formula based on access frequency and update cycle to achieve intelligent preloading, on-demand loading, and layer degradation of hot data, which significantly improves data access speed and reduces redundant calculation overhead.

[0024] (3) In this invention, the ability of each computing node to undertake tasks is evaluated in real time by using the optimized resource allocation calculation formula designed in the parallel computing and dynamic load balancing module. Tasks with slower execution progress or higher resource consumption are dynamically migrated and new tasks are preferentially allocated to the node with the highest optimized resource allocation. This effectively balances the cluster load, avoids single-point overload or idleness, and improves the overall computing efficiency.

[0025] (4) In this invention, the task execution status is monitored through a fault-tolerant retry mechanism. When a node failure or insufficient resources cause the task to fail, the failed task is reassigned to the available node with the highest optimized resource allocation and re-executed. The intermediate data that has been processed is obtained from the cache layer to realize the breakpoint continuation calculation, which avoids the waste of resources caused by full recalculation and enhances the stability and fault tolerance of the system.

[0026] Of course, any product implementing this invention does not necessarily need to achieve all of the advantages described above at the same time. Attached Figure Description

[0027] Figure 1 This is a diagram of the optimized campus report generation system based on data hierarchical caching and parallel computing according to the present invention.

[0028] Figure 2 This is a diagram showing the cached data access characteristics and popularity evaluation analysis of the present invention, wherein (a) is a schematic diagram of the relationship between actual access frequency and cache popularity evaluation value; (b) is a schematic diagram of the relationship between the time interval since the last update and cache popularity evaluation value; (c) is a schematic diagram comparing the number of hot data and non-hot data; and (d) is a histogram of cache popularity evaluation value distribution.

[0029] Figure 3 This is a correlation analysis diagram of node resource allocation and scheduling coefficients in this invention, wherein (a) is a schematic diagram of node ID and optimized resource allocation; (b) is a schematic diagram of the proportion of the top 10 nodes in terms of optimized resource allocation; (c) is a schematic diagram of the comparison of the top 15 nodes in terms of optimized resource allocation; and (d) is a schematic diagram of the joint distribution of progress weighting coefficient and resource priority coefficient.

[0030] Figure 4 This is a diagram illustrating the data tilt feature identification and hierarchical determination analysis of the present invention. Detailed Implementation

[0031] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0032] Please see Figures 1-4 This invention provides a technical solution: a campus report generation optimization system based on hierarchical data caching and parallel computing, comprising a data acquisition and preprocessing module for acquiring raw report data and performing time-axis relocation, cleaning, normalization, and partitioning / sharding on the raw report data; a data hierarchical caching and storage module for constructing a three-level hierarchical caching architecture based on the preprocessed raw report data, evaluating cache heat, and performing cache data hierarchy degradation processing; a parallel computing and dynamic load balancing module for optimizing resource allocation based on the execution status and resource consumption of computing tasks, adjusting load balancing in real time, and constructing a fault tolerance mechanism and a task retry mechanism; and a dynamic skew processing and adjustment module for using hierarchical sampling and identifying skew keys, performing hierarchical scattering aggregation and operator processing to obtain aggregation results, performing integration and verification to generate reports, and storing them in the cache layer.

[0033] Specifically, the process of acquiring raw report data and performing timeline relocation, cleaning, normalization, and segmentation on the raw report data is as follows: The entire process of collecting detailed data and basic dimension data generated from campus reports is performed to obtain the raw report data. This raw report data includes multi-dimensional time identifiers, aggregated dimension identifiers, terminal identifiers, associated entity identifiers, logically related indicators, and status identifiers. An NTP time server is introduced to synchronize the clocks of each data source, and a sliding window mean filtering algorithm is used to calculate the average deviation of each dimension's time identifier. The window size W is determined based on the statistical period and sampling frequency; for example, the number of sampling points within the last 5 minutes is taken. Using the window size W and sampling period, the average deviation is calculated as the arithmetic mean of the time deviations of each sampling point. The raw report data is then timeline relocated, the time identifiers of each dimension are extracted, and the average time deviation is calculated. The calibrated standardized business time is used as the unique statistical time identifier. The standardized time is... Dimensionless time-series labels are used for cross-source data alignment. A set of status rules, including test identifiers, exception status codes, and non-empty verification rules for key field integrity checks to distinguish between test and production data, is used for line-by-line matching to clean the original report data, filtering out invalid data of test, exception, and missing key fields. Various dimension identifiers, terminal identifiers, and associated entity identifiers are normalized by mapping heterogeneous codes from different source systems to a unified internal standard dictionary and using a linear mapping method based on hash functions. The normalized identifiers are dimensionless internal key values, eliminating the physical meaning of the original codes. Based on the report's aggregated dimensions, a consistent hash algorithm is used to initially partition and shard the normalized original report data, mapping the aggregated dimension values ​​to a hash ring. The generated shard ID is used to determine the shard to which the data belongs; the shard ID is a dimensionless logical partition identifier. Simultaneously, the data volume and access frequency of each shard are counted, and hot data shards and potentially skewed shards are marked.

[0034] In this implementation plan, by performing unified timeline relocation, rule-based cleaning, standardization and normalization, and segmentation processing on the original data of the reports, the temporal consistency, caliber consistency and distribution balance of multi-source data in the process of generating campus reports are effectively improved. This avoids statistical distortion, caliber confusion and local computing bottlenecks caused by clock deviation, abnormal dirty data, heterogeneous coding differences and data skew.

[0035] Specifically, the process of constructing a three-tiered cache architecture based on the preprocessed report data is as follows: Based on the sharding characteristics and access frequency of the preprocessed report data, a three-tiered cache architecture is constructed, consisting of a raw detail cache layer, an intermediate aggregation cache layer, and a report result cache layer. The sharding characteristics include data size, dimensional cardinality, and update frequency, while the access frequency is obtained through sliding window statistics. The raw detail cache layer uses node-local memory-level caching to store hot data shards. Node-local memory-level caching offers microsecond-level read / write latency, but its capacity is limited by single-machine memory. The intermediate aggregation cache layer uses distributed SSD-level caching to store the preliminary aggregation data for each dimension generated during parallel computation. Distributed SSD-level caching balances large capacity with moderate access speed, supporting computation... Nodes share access; the report result caching layer uses a distributed cache cluster to store standardized report data; the distributed cache cluster uses a multi-replica mechanism to ensure availability, and the storage format is a pre-calculated wide table or indicator set; a dual mechanism of data version number and readiness flag is used to manage cache updates. Each time a report is generated or data is calculated, the data in the cache is verified according to the update timestamp and readiness flag. Verification timing includes cache hit, task startup, and before data loading. Cache data is considered valid only when the data version number is consistent and the readiness flag is in a completed state; the completed state indicates that the data has passed integrity verification and consistency checks; when the underlying data source changes, the new version of the data completely replaces the old version through an atomic switching mechanism with sharding granularity. Sharding granularity is based on aggregation dimensions, and atomic switching is implemented through metadata pointer updates. During the switching period, the old version of the data can still provide read services until the new data is fully ready.

[0036] In this implementation plan, a three-tiered caching architecture consisting of an original detail caching layer, an intermediate aggregation caching layer, and a report result caching layer is constructed. This achieves a layered acceleration effect in the campus report generation process, effectively reducing the time overhead and resource consumption caused by repeatedly loading the underlying data source and repeatedly executing intermediate calculations. It also ensures the consistency, integrity, and availability of cached data during the update process, avoiding problems such as mixed reading of old and new data, misuse of semi-finished cache, and service interruption during the switching period.

[0037] Specifically, the process of cache popularity assessment is as follows: The actual access frequency of data per unit time is obtained by monitoring and statistically analyzing the number of access requests; the maximum effective update cycle is obtained by statistically analyzing the update frequency of cached data; the time interval between the current data and the last update is obtained by comparing the timestamp of the current data with that of the last update; the actual access frequency is multiplied by the maximum effective update cycle to obtain the actual popularity; the time interval between the data and the last update is added to the maximum effective update cycle to obtain the cache popularity; the actual popularity is divided by the cache popularity, and an exponential operation is performed using a popularity weighting index to obtain the cache popularity assessment value. The cache popularity assessment value is a dimensionless indicator used to uniformly measure the comprehensive impact of factors with different dimensions. The exponential approach enhances the influence of access frequency and update interval, giving higher-access-frequency data a greater weight in popularity assessment, while data with longer update cycles will gradually have a lower popularity value, thereby precisely controlling the allocation of cache resources.

[0038] The specific formula for calculating the cache heat evaluation value is as follows:

[0039] ;

[0040] In the formula, This represents the cache popularity evaluation value, used to measure the access priority of cached data; It represents the actual frequency of data access per unit time, obtained by monitoring and statistically analyzing the number of access requests, and is used to measure the frequency of data access. This indicates the maximum effective update cycle, which is obtained by statistically analyzing the update frequency of cached data and is used to determine the maximum time range for data updates. It represents the time interval since the last update of the data, obtained by comparing the current data with the timestamp of the last update, and is used to measure the freshness of the data; This represents the popularity weighted index, calculated using a cache data-driven optimization algorithm. Its value ranges from 0 to 1 and is used to control the impact of access frequency and data update interval on the cache popularity evaluation value.

[0041] A preloading mechanism is used for the original detail cache layer. The cache heat assessment value and loading threshold are compared in real time. When the cache heat assessment value exceeds the loading threshold, the data is considered hot data and is loaded into the cache in advance. Before loading, the readiness flag of the data is checked. If the data is not fully ready, the loading request is intercepted, and an older version of the data is returned or the data is loaded after it becomes ready. The readiness flag indicates that the data has been cleaned, aggregated, and validated in multiple dimensions, and the version number is consistent with the underlying data source. When the cache heat assessment value does not reach the loading threshold, preloading is not performed; instead, loading occurs on demand. On-demand loading reads and caches data from the underlying storage when a query or calculation request arrives, loading only the shards required by the request. For data that is not in the original detail cache layer, an on-demand loading mechanism is used, loading cached data only when a calculation or query request is triggered. The real-time cache heat assessment value is used to determine whether data needs to be loaded from the cache to ensure that cache resources are not occupied. This mechanism is also used for data in the intermediate aggregation cache layer and the report result cache layer. The heat assessment is used to determine whether to promote the data to a higher priority cache layer.

[0042] Table 1 shows the cache heat assessment data table, which is used to calculate cache heat by comprehensively considering access frequency, update cycle, and time interval, and to determine whether it is hot data. Data ID1: Actual access frequency is 19.8985, maximum effective update cycle is 17.8220, time interval since last update is 27.9180, cache heat assessment value is 5.2338, hot data is FALSE; Data ID2: Actual access frequency is 20.6646, maximum effective update cycle is 12.8788, time interval since last update is 31.8175, cache heat assessment value is 5.3349, hot data is FALSE; Data ID3: Actual access frequency is 94.3787, maximum effective update cycle is 11.2167, time interval since last update is 7.0413, cache heat assessment value is 1.0099, hot data is FALSE; Data ID17: Actual access frequency is 22... Data ID 18: Actual access frequency 76.3486, maximum effective update period 18.7892, time interval since last update 36.6083, cache heat assessment value 18.7473, hot data status is TRUE; Data ID 30: Actual access frequency 76.3486, maximum effective update period 18.7892, time interval since last update 36.6083, cache heat assessment value 18.7473, hot data status is TRUE; The actual access frequency is 91.8497, the maximum effective update cycle is 20.8493, the time interval since the last update is 21.8821, the cache hotness assessment value is 36.3816, and whether it is hot data is TRUE; Data ID31: The actual access frequency is 6.8005, the maximum effective update cycle is 19.4829, the time interval since the last update is 4.5481, the cache hotness assessment value is 4.15648, and whether it is hot data is FALSE.

[0043] Table 1. Cache Popularity Evaluation Data Table

[0044]

[0045] like Figure 2The cached data access characteristics and popularity assessment analysis chart shown are used to characterize the access behavior and popularity assessment characteristics of cached data. (a) is a schematic diagram of the relationship between actual access frequency and cache popularity assessment value: the horizontal axis represents the actual access frequency of cached data, and the vertical axis represents the cache popularity assessment value. The scatter color indicates the magnitude of the popularity assessment value. As shown in (a), data with high access frequency corresponds to higher popularity assessment values, which verifies the popularity assessment's identification result for hot data. (b) is a schematic diagram of the relationship between the time interval since the last update and the cache popularity assessment value: the horizontal axis represents the time interval since the last update, and the vertical axis represents the popularity assessment value. As shown in (b), data with higher popularity assessment values ​​corresponds to shorter update intervals, which characterizes the update activity of hot data. (c) is a schematic diagram comparing the quantity of hot data and non-hot data, used to show the difference in the quantity distribution of the two types of data. (d) is a histogram of cache popularity assessment value distribution, used to show the overall distribution characteristics of popularity assessment values, reflecting that popularity assessment values ​​are concentrated in the low interval and the proportion of high-popularity data is relatively small. Figure 2 It is used to provide a basis for caching strategy optimization, hot data preloading, and cache resource allocation.

[0046] In this implementation plan, a cache hotness assessment mechanism based on access frequency, update cycle, and update interval is constructed. This enables precise identification and hierarchical scheduling control of the hotness of campus report data, allowing frequently accessed and update-sensitive data to be prioritized for entry into a higher response level cache layer, while low-frequency or low-value data avoids occupying cache resources for a long time, thereby effectively improving cache resource utilization and hotspot hit efficiency.

[0047] Specifically, the process for downgrading cached data layers is as follows: When the popularity assessment value of cached data is lower than the eviction threshold, the data is downgraded from the original detailed cache layer to the intermediate aggregation cache layer, or from the intermediate aggregation cache layer to the report result cache layer, or the data is directly cleared. The downgrade threshold is dynamically adjusted according to the capacity level and access latency requirements of each layer. Before clearing, the data is checked for downstream dependencies. The consistency of data in each cache layer is monitored in real time. By comparing the change records of cached data with those of the underlying data source, when the underlying data source is updated, the priority of the change record is defined according to the impact of the changed fields on the statistical scope of the report. The priority is divided into three levels: key field changes, ordinary field changes, and irrelevant field changes. The priority of cache popularity assessment value and data source change record is compared in real time. When the popularity assessment value exceeds the loading threshold and the data changes, the incremental update of the corresponding cache layer will be triggered. An incremental update mechanism is adopted, which only updates the modified part of the data. During the incremental update process, the modified data part is marked as temporary. During the temporary state, read requests still return the old version of the data. After all updates are completed, the new data takes effect through atomic switching. Priority is determined by the weight of the field in the statistical indicators. Incremental updates are performed at the field or shard level. During temporary states, read requests return old version data. Atomic switching achieves a seamless transition by incrementing the version number.

[0048] This implementation plan establishes a cache layer degradation and incremental update mechanism based on popularity assessment values, capacity levels, access latency requirements, and data change priorities. This enables cached data to dynamically migrate between the original detailed cache layer, the intermediate aggregation cache layer, and the report result cache layer. This avoids low-popularity data from occupying high-performance cache resources for a long time, while ensuring that high-value, high-time-sensitivity data can be continuously retained in the appropriate layer, thereby improving the overall cache space utilization and query response efficiency.

[0049] Specifically, the process of optimizing resource allocation based on the execution status and resource consumption of computing tasks is as follows: The total resource quantity of the computing node is obtained through its hardware configuration, including the number of CPU cores, memory capacity, disk IOPS, and network bandwidth, acquired through node registration information or a configuration center; the task execution progress of the computing node is obtained by real-time monitoring of its task execution status, quantified based on the proportion of currently completed data to the total data volume or the percentage of the executed stage to the overall process; the total time consumed by the computing task is obtained by monitoring its start and end times, and this total time is used to evaluate task execution efficiency and node processing capacity. It supports task time statistics and trend analysis; it obtains the remaining resources of computing nodes by real-time monitoring, including available CPU, idle memory, and disk throughput capacity, collected in real-time through the operating system or container runtime interface; it constructs a resource allocation evaluation model; it calculates the progress of tasks on computing nodes by adding 1, and then performs an exponential operation on the progress weighting coefficient to obtain the progress correlation value; it divides the total time of computing tasks by the remaining resources of computing nodes, and then performs an exponential operation on the resource priority coefficient to obtain the resource correlation value; it multiplies the progress correlation value and the resource correlation value to obtain the denominator; it divides the total resource quantity of computing nodes by the denominator to obtain the optimized resource allocation quantity. The optimized resource allocation quantity is a dimensionless indicator used to uniformly measure the comprehensive impact of factors with different dimensions.

[0050] The specific formula for calculating the optimized resource allocation is as follows:

[0051] ;

[0052] In the formula, This represents the optimized resource allocation for each computing node, used to determine the optimal resource allocation for each computing node; This represents the total resources of a computing node, obtained through the hardware configuration of the computing node. The value range is a positive number, used to ensure that resource allocation does not exceed the total capacity of the node. This indicates the task execution progress of the compute node, which is obtained by monitoring the task execution status of the compute node in real time. The value ranges from 0 to 1 and is used to measure the progress of the task execution on the current compute node. This represents the schedule weighting coefficient, which is calculated through regression analysis of the data. Its value ranges from 0 to 1 and is used to adjust the degree of impact of task execution progress on resource allocation. This represents the total time taken for the computation task. It is obtained by monitoring the start and end times of the task and takes a value greater than 0. This represents the remaining resources of a computing node, obtained by real-time monitoring of the remaining resources of the computing node. The value range is a positive number, used to measure the remaining computing power of the current computing node. This represents the resource priority coefficient, which is dynamically adjusted by the task scheduling system based on the resource status of the computing nodes. The value ranges from 0 to 1 and is used to adjust the priority of task allocation based on the remaining resources.

[0053] like Figure 3 The diagram showing the correlation between node resource allocation and scheduling coefficients is used to characterize the correlation between node-side resource allocation results and scheduling coefficients. (a) is a schematic diagram of node ID and optimized resource allocation: the horizontal axis represents node ID, and the vertical axis represents the optimized resource allocation. As shown in (a), the resource allocation varies significantly among different nodes, with some nodes having high resource allocations and the rest mainly distributed in low ranges, reflecting the non-uniformity of resource allocation. (b) is a schematic diagram showing the proportion of the top 10 nodes in terms of optimized resource allocation, illustrating the composition and concentration of high-resource-quota nodes among the top 10. (c) is a comparative diagram of the top 15 nodes in terms of optimized resource allocation: a bar chart compares the resource allocations of the top 15 nodes, presenting the relative differences and ranking characteristics among high-resource nodes. (d) is a schematic diagram of the joint distribution of schedule weighting coefficient and resource priority coefficient: the horizontal axis is the schedule weighting coefficient, the vertical axis is the resource priority coefficient, and the color is used to represent the frequency of the joint value in the sample; as can be seen from (d), the joint distribution is mainly concentrated in the low coefficient range and relatively clustered in the medium coefficient combination area, which is used to represent the correspondence between the scheduling coefficient combination and the resource scheduling behavior.

[0054] In this implementation plan, by comprehensively introducing multi-dimensional operational status information such as the total resource quantity of computing nodes, task execution progress, total task time, and remaining resources, a resource allocation optimization mechanism oriented towards the collaborative constraints of execution status and resource consumption is constructed. This realizes dynamic balanced scheduling and adaptive resource allocation of computing tasks among multiple nodes, avoiding the node idleness and local overload problems caused by allocating resources solely based on static hardware capabilities.

[0055] Specifically, the real-time load balancing adjustment process is as follows: The optimized resource allocation of each node is compared with the ratio threshold in real time. When the optimized resource allocation of a node is lower than the ratio threshold, tasks currently executing on that node whose progress is below the progress threshold or whose resource consumption is above the consumption threshold, and which have checkpoint or recalculation capabilities, are marked as tasks to be migrated. These tasks are then assigned to computing nodes with optimized resource allocations above the ratio threshold. The optimized resource allocation is dynamically calculated based on the total resources of the node, remaining resources, and task load. The ratio threshold is set according to the cluster's average resource utilization and balance requirements. The progress threshold and consumption threshold are dynamically adjusted according to the task type. Tasks with checkpoint or recalculation capabilities refer to operators that can save intermediate states or support re-execution. During migration, the task metadata is sent to the target node through the scheduler, and the task is restored or restarted from the checkpoint. For newly submitted computing tasks, initial allocation is performed based on the current optimized resource allocation of each node, prioritizing the assignment of new tasks to the computing node with the highest optimized resource allocation. The optimized resource allocation is periodically updated during task execution. The initial allocation takes into account data locality to reduce network overhead, and the periodic update recalculates the allocation amount by monitoring node resource usage and task progress in real time, providing a basis for task scheduling and load balancing.

[0056] In this implementation plan, by establishing a real-time load balancing mechanism based on optimized resource allocation, the dynamic migration and adaptive reallocation of computing tasks among multiple nodes are realized. This enables nodes with insufficient resources, excessive load, or decreased execution efficiency to release migrateable tasks in a timely manner and smoothly transfer them to nodes with more sufficient resources to continue execution, thereby avoiding local nodes from becoming long-tail bottlenecks or hotspot blocking points.

[0057] Specifically, the process of constructing the fault tolerance mechanism and task retry mechanism is as follows: Monitor the execution status of computing tasks. When a computing task fails due to a fault in its computing node or when the node's remaining resources fall below the minimum resource threshold, it is determined to be unavailable, triggering the fault tolerance mechanism. The failed computing task is marked as a task to be retried. Based on the optimized resource allocation of each available computing node, the task to be retried is assigned to the available computing node with the highest optimized resource allocation for re-execution. The optimized resource allocation is calculated in real-time, taking into account the node's remaining resources, task priority, and resource fragmentation. When retrying a failed computing task, check if complete intermediate data exists in the cache. Complete intermediate data must meet the following requirements: the version number matches the version number recorded in the task execution context; the readiness flag is "completed"; and it contains all the fragment keys required by the task. The check also verifies the data's integrity and recoverability. If the data exists, it is directly loaded and execution continues from the point of interruption; otherwise, expired or incomplete data in the cache is discarded, and the entire computing task is re-executed. Re-executed tasks are given priority in resource allocation, and the reason for failure is recorded for scheduling optimization.

[0058] In this implementation plan, by constructing a fault-tolerance mechanism and a task retry mechanism for node failure and resource shortage scenarios, the computing task can be quickly rescheduled based on node availability and optimized resource allocation after execution interruption. This effectively avoids the problem of overall task rollback and duplicate calculation caused by single node failure, sudden resource drop or cache failure.

[0059] Specifically, the process of obtaining the aggregation result by hierarchical sampling and identification of skew keys, followed by hierarchical scattering aggregation and operator processing, is as follows: Based on the aggregated data of the intermediate aggregation cache layer, hierarchical sampling is performed according to the data sharding level. The sampling ratio is dynamically adjusted according to the amount of data in each shard and the accuracy requirements. Samples are extracted for each aggregation dimension feature key in each shard according to the sampling ratio. Random sampling or equidistant sampling methods are used to ensure that the samples are unbiased. The frequency of each feature key in the samples is counted, and the actual data volume of each feature key is estimated according to the sampling ratio. The estimation formula is the sample frequency divided by the sampling ratio and then multiplied by the sharding weight coefficient. The sharding weight coefficient reflects the importance of the shard in the full data. The ratio of the actual data volume of each feature key to the average data volume of all aggregation dimension feature keys is calculated, and the ratio is recorded as the data skew judgment value. Skewed feature keys with a data skew judgment value greater than the skew threshold are identified, and skewed feature keys are constructed. The dataset uses a skewed feature key dataset. An adaptive partitioning strategy is employed. Skewed keys with a skew threshold greater than the skew threshold but less than or equal to the second-order skew threshold are treated as ordinary skewed feature keys. A random prefix is ​​added, and the dataset is shuffled into M temporary computation partitions. M is dynamically determined based on the degree of skew, typically a prime number such as 7 or 13 to ensure uniform distribution. The length and range of the random prefix are calculated based on the feature key cardinality and the number of partitions to ensure balanced data volume across partitions after shuffling. Each temporary partition independently completes local aggregation calculations, and then the results are merged to obtain the aggregated feature key result. During merging, the random prefix is ​​removed, and the final aggregation is performed. If multiple aggregations occur, incremental merging is used to reduce redundant calculations. Skewed keys with a skew threshold greater than the second-order skew threshold are treated as second-order feature keys. A dedicated aggregation operator is started separately, allocating independent computing resources and dedicated cache shards. The dedicated operator uses a two-stage aggregation or incremental update method, and the dedicated cache shards avoid data contention and improve locality.

[0060] like Figure 4The data skew feature identification and grading analysis chart shown uses dual-axis visualization to demonstrate the system's logic for identifying and grading data skew features: The left vertical axis (actual data volume) presents the distribution of actual data volume for each feature key in scatter plot form, distinguishing different skew levels by color and size: blue dots represent normal feature keys with low and stable data volume; yellow dots represent ordinary skewed feature keys with slightly higher data volume than normal; large red dots represent second-order skewed feature keys with a significant surge in data volume, and the dot size is positively correlated with the data volume. The right vertical axis (data skew judgment value): the green curve represents the data skew judgment value, highly synchronized with the data volume trend on the left; the orange dashed line represents the skew threshold, and the red dashed line represents the second-order skew threshold, serving as the baseline for grading. When the judgment value exceeds the orange threshold, the feature key is marked as ordinary skewed; when the judgment value exceeds the red second-order threshold, a second-order skew warning is triggered, corresponding to a significant peak in the feature key index range of 50–63 in the chart, perfectly matching the distribution of second-order skewed feature keys. Overall, the study reveals a hierarchical determination mechanism for data skew: by linking data volume with skew determination values ​​and combining dual threshold grading, it achieves accurate identification of normal, ordinary skew, and second-order skew feature keys, providing intuitive judgment criteria and visualization support for data partitioning optimization, load balancing, and system stability improvement.

[0061] In this implementation scheme, by performing hierarchical sampling, skew key identification, and hierarchical scattering aggregation processing on the intermediate aggregation cache layer data, the system achieves accurate identification and differentiated handling of data skew problems. This effectively reduces the long tail of tasks, partition imbalance, and cache contention caused by hot keys, and significantly improves the load balancing, parallel processing efficiency, and system stability of aggregation computing.

[0062] Specifically, the process of integrating, validating, generating reports, and storing them in the cache layer is as follows: The aggregation results after processing skewed feature keys are uniformly integrated. Combining the general statistical standards for campus reports, the integrated aggregated data undergoes multi-dimensional validation. The integration process employs hash mapping and merge sorting. The statistical standards are derived from Ministry of Education standards and internal school business rules. The completeness of records for each dimension is validated by comparing the difference between the actual number of records and the theoretical number of records calculated based on the dimension table cardinality and time granularity to see if it exceeds the missing rate threshold. The theoretical number of records is obtained by multiplying the dimension table cardinality by the time granularity. Data consistency between related dimensions is validated by calculating whether the deviation of logically related indicators between dimensions exceeds the consistency coefficient threshold. Logically related indicators include the relationship between the number of students and the total number of students in a class. The system verifies whether the indicator values ​​are within a reasonable range by comparing them with the upper and lower thresholds defined by statistical standards. It also verifies whether the time series data meets continuity requirements by checking whether the fluctuation range of the same indicator value within adjacent time periods exceeds the fluctuation threshold. If any data version is detected as not ready, an anomaly repair process is triggered, performing data supplementation, recalculation, or recovery from the cache according to the anomaly type. "Not ready" refers to inconsistent version numbers or a "ready" flag indicating incompleteness. Anomaly types include missing data, calculation errors, and cache invalidation, and the repair process is automatically selected based on the type. After successful verification, standardized campus report data is generated and stored in the report result cache layer, while simultaneously updating the report data's version number and readiness flag. Standardized reports use Parquet columnar storage, and the cache layer writes data in units of fragments, with version numbers incremented and the readiness flag set to "completed."

[0063] In this implementation plan, by performing unified integration and multi-dimensional verification on the aggregation results after processing the skewed feature keys, the accuracy and reliability of the campus report generation results in terms of record completeness, dimensional consistency, numerical rationality and time continuity are effectively guaranteed. This avoids problems such as report distortion, misreporting and inconsistent results caused by data missing, correlation deviation, abnormal fluctuation or version incompatibility.

[0064] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus.

[0065] The preferred embodiments of the present invention disclosed above are merely illustrative of the invention. These preferred embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the content of this specification. This specification selects and specifically describes these embodiments to better explain the principles and practical applications of the invention, thereby enabling those skilled in the art to better understand and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A campus report generation optimization system based on hierarchical data caching and parallel computing, characterized in that, include: The data acquisition and preprocessing module is used to acquire the raw data of the report and perform time axis relocation, cleaning, normalization and segmentation on the raw data of the report. The data tiered caching and storage module is used to build a three-level tiered caching architecture based on the preprocessed report raw data, perform cache heat assessment, and perform cache data tier degradation processing. The specific process of constructing a three-level hierarchical caching architecture based on the preprocessed original report data is as follows: Based on the sharding characteristics and access frequency of the preprocessed report raw data, a three-level layered cache architecture is constructed, consisting of a raw detail cache layer, an intermediate aggregation cache layer, and a report result cache layer. The raw detail cache layer uses node-local memory-level cache to store hot data shards, the intermediate aggregation cache layer uses distributed SSD-level cache to store the preliminary aggregation data of each dimension generated during parallel computing, and the report result cache layer uses a distributed cache cluster to store the generated standardized report data. The cache updates are managed using a dual mechanism of data version number and readiness flag. Each time a report is generated or data is calculated, the data in the cache is verified based on the update timestamp and readiness flag. The cached data is considered valid only when the data version number is consistent and the readiness flag is in a completed state. When the underlying data source changes, the new version data replaces the old version as a whole through an atomic switching mechanism with sharding as the granularity. The parallel computing and dynamic load balancing module is used to optimize resource allocation based on the execution status and resource consumption of computing tasks, adjust load balancing in real time, and build fault tolerance and task retry mechanisms. The dynamic tilt processing and adjustment module is used to adopt hierarchical sampling and identify tilt keys, perform hierarchical scattering and aggregation and operator processing to obtain aggregation results, perform integration and verification to generate reports and store them in the cache layer. An adaptive partitioning strategy is adopted for the skewed feature key dataset. Skewed keys with skew judgment values ​​greater than the skew threshold and less than or equal to the second-order skew threshold are treated as ordinary skewed feature keys. A random prefix is ​​added to scatter them into M temporary computation partitions. Each temporary partition independently completes local aggregation calculations, and then the results of each local aggregation are merged to obtain the aggregated result of the feature keys. Skewed keys with skew judgment values ​​greater than the second-order skew threshold are treated as second-order feature keys. A dedicated aggregation calculation operator is started separately, and independent computing resources and dedicated cache partitions are allocated.

2. The campus report generation optimization system based on data hierarchical caching and parallel computing according to claim 1, characterized in that: The specific process of obtaining the raw report data and performing timeline relocation, cleaning, normalization, and segmentation on the raw report data is as follows: The entire process of collecting all detailed data and basic dimension data generated from campus reports is carried out to obtain the original report data. The original report data includes multi-dimensional time identifiers, aggregated dimension identifiers, terminal identifiers, associated entity identifiers, logically related indicators, and status identifiers. By introducing an NTP time server to synchronize clocks across various data sources, and employing a sliding window mean filtering algorithm to calculate the average deviation of time identifiers for each dimension, the original report data is repositioned along the time axis. Time identifiers for each dimension are extracted, and the average time deviation is calculated. The calibrated standardized business time is used as the unique statistical time identifier. The original report data is cleaned by matching each rule set containing test identifiers, exception status codes, and key field integrity checks, filtering out invalid data of test, exception, and missing key fields. Finally, through a unified coding standard and linear mapping method, various dimension identifiers, terminal identifiers, and associated entity identifiers are normalized. Based on the report aggregation dimension, a consistent hashing algorithm is used to perform preliminary partitioning and sharding of the normalized original report data. At the same time, the data volume and access frequency of each shard are counted, and hot data shards and potentially skewed shards are marked.

3. The campus report generation optimization system based on data hierarchical caching and parallel computing according to claim 1, characterized in that: The specific process for evaluating cache popularity is as follows: The actual access frequency of data per unit time is obtained by monitoring and statistically analyzing the number of access requests. The maximum effective update cycle is obtained by statistically analyzing the update frequency of cached data. The time interval between the current data and the last update is obtained by comparing the timestamp of the current data with that of the last update. The actual access frequency is multiplied by the maximum effective update cycle to obtain the actual popularity. The time interval between the data and the last update is added to the maximum effective update cycle to obtain the cache popularity. The actual popularity is divided by the cache popularity and an exponential operation is performed to obtain the cache popularity evaluation value. A preloading mechanism is used for the original detailed cache layer. The cache popularity assessment value and the loading threshold are compared in real time. When the cache popularity assessment value exceeds the loading threshold, the data is considered hot data and is loaded into the cache in advance. Before loading, the readiness flag of the data is checked. If the data is not fully ready, the loading request is intercepted, and the old version of the data is returned or the data is loaded after it is ready. When the cache popularity assessment value does not reach the loading threshold, preloading is not performed, and the data is loaded on demand. For data that is not in the original detailed cache layer, an on-demand loading mechanism is used. The cache data is only loaded when a calculation or query request is triggered. The cache popularity assessment value calculated in real time determines whether data needs to be loaded from the cache to ensure that cache resources are not occupied.

4. The campus report generation optimization system based on data hierarchical caching and parallel computing according to claim 1, characterized in that: The specific process for performing cached data hierarchy degradation is as follows: When the popularity assessment value of cached data falls below the eviction threshold, the data is downgraded from the original detail cache layer to the intermediate aggregation cache layer, or from the intermediate aggregation cache layer to the report result cache layer, or the data is directly cleared. The data consistency of each cache layer is monitored in real time. By comparing the change records of cached data with those of the underlying data source, when the underlying data source is updated, the priority of the change record is defined according to the impact of the changed field on the statistical scope of the report. The priority of the cache popularity assessment value and the data source change record are compared in real time. When the popularity assessment value exceeds the loading threshold and the data changes, the incremental update of the corresponding cache layer will be triggered. An incremental update mechanism is adopted to update only the modified part of the data. During the incremental update process, the modified data part is marked as temporary. After all updates are completed, the new data will take effect through atomic switching.

5. The campus report generation optimization system based on data hierarchical caching and parallel computing according to claim 1, characterized in that: The specific process of optimizing resource allocation based on the execution status and resource consumption of computing tasks is as follows: The total resources of the computing nodes are obtained by analyzing their hardware configurations, the task execution progress is obtained by monitoring the task execution status of the computing nodes in real time, the total time consumed by the computing tasks is obtained by monitoring the start and end times of the tasks, and the remaining resources of the computing nodes are obtained by monitoring the remaining resources of the computing nodes in real time. A resource allocation evaluation model is then constructed. The progress of each computing node is summed by adding 1 to the progress of its tasks, and then the progress weighting coefficient is exponentially calculated to obtain the progress correlation value. The total time spent on the computing tasks is divided by the remaining resources of the computing node, and the resource priority coefficient is exponentially calculated to obtain the resource correlation value. The progress correlation value and the resource correlation value are multiplied to obtain the denominator. The total resources of the computing node are divided by the denominator to obtain the optimized resource allocation.

6. The campus report generation optimization system based on data hierarchical caching and parallel computing according to claim 1, characterized in that: The specific process of real-time load balancing adjustment is as follows: The system compares the optimized resource allocation of each node with the ratio threshold in real time. When the optimized resource allocation of a node is lower than the ratio threshold, tasks currently being executed on that node that have a progress rate lower than the progress threshold or a resource consumption rate higher than the consumption threshold, and that have checkpoints or recalculation capabilities, are marked as tasks to be migrated. These tasks are then assigned to computing nodes with optimized resource allocation rates higher than the ratio threshold. For newly submitted computing tasks, an initial allocation is performed based on the current optimized resource allocation of each node, prioritizing the assignment of new tasks to the computing node with the highest optimized resource allocation. The optimized resource allocation is also periodically updated during task execution.

7. The campus report generation optimization system based on data hierarchical caching and parallel computing according to claim 1, characterized in that: The specific process for constructing the fault tolerance mechanism and task retry mechanism is as follows: Monitor the execution status of computing tasks, and trigger the fault tolerance mechanism when it is detected that a computing task fails to execute due to a failure of the computing node or the remaining resources of the node are lower than the resource threshold. Failed computing tasks are marked as tasks to be retried. Based on the optimized resource allocation of each available computing node, the tasks to be retried are assigned to the available computing node with the highest optimized resource allocation for re-execution. When retrying a failed computation task, check if there is complete intermediate data in the cache. Complete intermediate data must meet the following conditions: the version number is consistent with the version number recorded in the task execution context, the ready flag is marked as completed, and it contains all the fragment keys required by the task. If it exists, load the data directly and continue execution from the breakpoint; otherwise, discard the expired or incomplete data in the cache and re-execute the entire computation task.

8. The campus report generation optimization system based on data hierarchical caching and parallel computing according to claim 1, characterized in that: The specific process of obtaining the aggregation result by adopting hierarchical sampling and identifying skew keys, and performing hierarchical scattering aggregation and operator processing is as follows: Based on the aggregated data of the intermediate aggregation cache layer, hierarchical sampling is performed according to the data sharding level. Samples are extracted for each aggregation dimension feature key in each shard according to the sampling ratio. The frequency of occurrence of each feature key in the samples is counted, and the actual data volume of each feature key is estimated according to the sampling ratio. The ratio of the actual data volume of each feature key to the average data volume of all aggregation dimension feature keys is calculated, and the ratio is recorded as the data skew judgment value. Skewed feature keys with data skew judgment values ​​greater than the skew threshold are identified, and a skewed feature key dataset is constructed.

9. The campus report generation optimization system based on data hierarchical caching and parallel computing according to claim 1, characterized in that: The specific process of integrating, verifying, generating reports, and storing them in the cache layer is as follows: The aggregation results after processing the skewed feature keys are uniformly integrated. Combined with the general statistical standards for campus reports, the integrated aggregated data undergoes multi-dimensional verification. This includes verifying the completeness of records for each dimension indicator by comparing the difference between the actual number of records and the theoretical number of records calculated based on the dimension table cardinality and time granularity to see if it exceeds the missing rate threshold; verifying the data consistency between related dimensions by calculating whether the deviation of logically related indicators between dimensions exceeds the consistency coefficient threshold; and verifying whether the indicator values ​​are within a reasonable range by comparing the indicator values ​​with the upper and lower limit thresholds defined by the statistical standards. Verify whether the time series data meets the continuity requirement by detecting whether the fluctuation range of the same indicator value in adjacent time periods exceeds the fluctuation threshold; if any data version is detected as not ready, trigger the anomaly repair process and perform data supplementation, recalculation or recovery from cache according to the anomaly type; After successful verification, standardized campus report data is generated and stored in the report result cache layer. At the same time, the version number and readiness flag of the report data are updated.