Data aggregation method, and distributed system, computing device and readable storage medium

By introducing a key popularity weighting mechanism into the distributed system, the target key is dynamically selected for local or global aggregation, which solves the performance bottleneck in scenarios with mixed high and low cardinality data, achieves more efficient data aggregation and cache management, and improves the overall performance of the system.

WO2026124135A1PCT designated stage Publication Date: 2026-06-18CLOUD INTELLIGENCE ASSETS HOLDING (SINGAPORE) PTE LTD +1

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
CLOUD INTELLIGENCE ASSETS HOLDING (SINGAPORE) PTE LTD
Filing Date
2025-11-18
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

In distributed systems, existing technologies struggle to dynamically and flexibly differentiate between high and low cardinality data to be aggregated, resulting in low local aggregation efficiency and excessive data transmission overhead. This is especially true in scenarios with mixed high and low cardinality, where data aggregation speed and efficiency cannot be effectively improved.

Method used

A hot-key weighting mechanism is introduced, where data to be aggregated is aggregated and its hot-key weight is updated on the local node. The target key is dynamically selected based on the hot-key weight. High-hot data is kept locally for local aggregation, while low-hot data is transferred to the central node for global aggregation. A hash table is used to manage the cache space to optimize the data processing flow.

🎯Benefits of technology

It improves local aggregation efficiency, reduces unnecessary data transmission overhead, significantly enhances the data aggregation speed and overall performance of distributed systems, avoids excessive cache space usage, and improves the overall processing efficiency of the system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025135676_18062026_PF_FP_ABST
    Figure CN2025135676_18062026_PF_FP_ABST
Patent Text Reader

Abstract

Provided in the embodiments of the present disclosure are a data aggregation method, and a distributed system, a computing device and a readable storage medium. The data aggregation method is used for a local node of a distributed system, wherein the distributed system further comprises a central node. The method comprises: acquiring a plurality of pieces of data to be aggregated, wherein any piece of data to be aggregated corresponds to a first key; aggregating the plurality of pieces of data to be aggregated, so as to obtain local aggregated data, and updating the popularity weight of the first key; on the basis of the popularity weight of at least one key corresponding to each piece of data in the local node, determining a target key from the at least one key, wherein the at least one key comprises the first key; and transmitting target data, which corresponds to the target key, to a central node for aggregation, so as to obtain global aggregated data. The efficiency of local aggregation is improved, unnecessary data transmission overheads are reduced, and the speed and efficiency of data aggregation in a distributed system are significantly improved.
Need to check novelty before this filing date? Find Prior Art

Description

Data aggregation methods, distributed systems, computing devices, and readable storage media Technical Field

[0001] This disclosure relates to the technical field of databases and big data analysis, and in particular to a data aggregation method, a distributed system, a computing device, and a readable storage medium. Background Technology

[0002] In the field of big data, with the rapid growth of data volume, traditional data aggregation methods face significant performance bottlenecks, especially when performing large-scale data aggregation in distributed systems, where the memory access (I / O) overhead of data transmission between nodes becomes one of the performance bottlenecks.

[0003] Currently, to reduce memory access overhead and network bandwidth consumption, data aggregation in distributed systems typically employs a "two-stage aggregation" strategy. This involves partially aggregating data locally on local nodes before transmitting it to the central node for global aggregation. However, this two-stage aggregation strategy often yields good aggregation results when dealing with high cardinality (frequency) data. In reality, there are many cases where high and low cardinality data are mixed. Too much low-cardinality data leads to insufficient local aggregation, slowing down local aggregation efficiency, introducing additional aggregation overhead, and failing to effectively reduce data transmission overhead. A "one-stage aggregation" strategy, which directly aggregates globally, is more suitable in these situations. In real-time data aggregation scenarios on distributed systems, how to dynamically and flexibly distinguish between high and low cardinality data and execute appropriate aggregation strategies is a problem that urgently needs to be solved. Summary of the Invention

[0004] In view of the above, embodiments of this disclosure provide a data aggregation method. One or more embodiments of this disclosure also relate to another data aggregation method, a distributed system, a data aggregation apparatus, another data aggregation apparatus, a computing device, a computer-readable storage medium, and a computer program product, to address technical deficiencies in related technologies.

[0005] In one embodiment of this disclosure, a data aggregation method is provided, applied to a local node of a distributed system, the distributed system further including a central node. The method includes: acquiring multiple data to be aggregated, wherein any data to be aggregated corresponds to a first key; aggregating the multiple data to be aggregated to obtain local aggregated data, and updating the popularity weight of the first key; determining a target key from at least one key according to the popularity weight of each data in the local node corresponding to at least one key, wherein at least one key includes the first key; and transmitting the target data corresponding to the target key to the central node for aggregation to obtain global aggregated data.

[0006] By introducing a key heat weighting mechanism on local nodes, the frequency of key aggregation over a period of time is reflected. Based on the heat weight of each data in the local node corresponding to at least one key, the target key is determined from at least one key. The target keys with high and low cardinality are dynamically and flexibly distinguished. The target data corresponding to the target key is transmitted to the central node for global aggregation to obtain global aggregated data. The appropriate aggregation strategy is then executed, which not only improves the efficiency of local aggregation and reduces unnecessary data transmission overhead, but also significantly improves the speed and efficiency of data aggregation in the distributed system. Attached Figure Description

[0007] Figure 1 is a schematic diagram of one of the data aggregation methods.

[0008] Figure 2 is a schematic diagram of a data aggregation method.

[0009] Figure 3 is a schematic diagram of a data aggregation method.

[0010] Figure 4 is a flowchart of a data aggregation method provided in an embodiment of this disclosure.

[0011] Figure 5 is a schematic diagram of a data aggregation method provided in an embodiment of this disclosure.

[0012] Figure 6 is a flowchart of another data aggregation method provided in an embodiment of this disclosure.

[0013] Figure 7 is a flowchart of a data aggregation method for Internet of Things data provided in an embodiment of this disclosure.

[0014] Figure 8 is a schematic diagram of the structure of a distributed system provided in an embodiment of this disclosure.

[0015] Figure 9 is a schematic diagram of the structure of a data aggregation device provided in an embodiment of this disclosure.

[0016] Figure 10 is a schematic diagram of another data aggregation device provided in an embodiment of this disclosure.

[0017] Figure 11 is a structural block diagram of a computing device provided in an embodiment of this disclosure. Detailed Implementation

[0018] Numerous specific details are set forth in the following description to provide a full understanding of this disclosure. However, this disclosure can be implemented in many other ways than those described herein, and those skilled in the art can make similar extensions without departing from the spirit of this disclosure. Therefore, this disclosure is not limited to the specific implementations disclosed below.

[0019] The terminology used in one or more embodiments of this disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of this disclosure. The singular forms “a,” “the,” and “the” as used in one or more embodiments of this disclosure and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in one or more embodiments of this disclosure refers to and includes any or all possible combinations of one or more associated listed items.

[0020] It should be understood that although the terms first, second, etc., may be used to describe various information in one or more embodiments of this disclosure, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first may also be referred to as second without departing from the scope of one or more embodiments of this disclosure, and similarly, second may also be referred to as first. Depending on the context, the word “if” as used herein may be interpreted as “when”, “in response to a determination”, or “when…”.

[0021] Furthermore, it should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in one or more embodiments of this disclosure are all information and data authorized by the user or fully authorized by all parties. Moreover, the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.

[0022] First, the terms and concepts involved in one or more embodiments of this disclosure will be explained.

[0023] Data aggregation: The process of combining multiple data items into one or more more meaningful summaries. This can be achieved through various methods, including statistical calculations, grouping, and summarizing. These include, but are not limited to: SELECT statement: used to specify the data columns to retrieve, which may include the results of aggregate functions; SUM function: used to calculate the sum of a column; COUNT function: used to calculate the number of rows in a column; AVG function: used to calculate the average of a column; MAX function: used to return the maximum value in a column; MIN function: used to return the minimum value in a column; GROUP BY function: used to group data, usually used in conjunction with aggregate functions.

[0024] Hash-based aggregation algorithms are efficient data processing techniques commonly used in database systems for aggregation calculations, such as calculating sums, averages, and counts. They primarily utilize hash tables to quickly group and summarize data. The specific process is as follows: 1. Grouping Phase: In this phase, the database first scans the input data and generates a hash value for each aggregation key (such as the value of a column). This hash value is used to distribute the data into different buckets in the hash table. Each bucket corresponds to a specific aggregation key. 2. Aggregation Phase: For each bucket, the database calculates the aggregation result for that group, such as summation, counting, or other aggregation functions. This means that for rows with the same aggregation key, the database will perform the calculation in the corresponding bucket. 3. Output Result: Finally, the aggregation result is output as a new dataset, typically including the aggregation key and its corresponding aggregation function result.

[0025] Aggregation degree: A metric used to measure the effectiveness of aggregation. It is typically calculated by comparing the number of rows before and after aggregation. Given the number of rows R in the aggregated column and the number of rows C after aggregation, the aggregation degree G can be calculated using Formula 1: G = R / C (Formula 1)

[0026] Where R is the number of rows before aggregation (the number of rows in the original data), and C is the number of rows after aggregation (the number of rows in the aggregated data). The larger the aggregation degree G is, the fewer rows the same amount of original data is aggregated into, which usually indicates a better aggregation effect.

[0027] Two-Phase Aggregation: A common database query optimization technique primarily used for efficient aggregation operations when dealing with large datasets. Its basic idea is to divide the aggregation process into two phases to reduce the size of intermediate result sets and lower computational costs. The two-phase aggregation process is as follows: 1. First Phase: In the first phase, the data is divided into multiple partitions. Each partition calculates the aggregation result locally (e.g., summation, count, average, etc.). This typically involves local aggregation for each partition, for example, calculating on each node's data first. The result of this phase is the local aggregation result for each partition, not the global aggregation result. 2. Second Phase: In the second phase, the local aggregation results from all partitions are aggregated together for the final global aggregation. This step usually requires transferring the results from the first phase from each partition to the master node, or performing centralized processing through other mechanisms. The final result is the global aggregation result required by the user.

[0028] Two-stage aggregation offers the following advantages: 1. Reduced data transfer: The local aggregation result in the first stage is typically smaller than the original dataset, so only a smaller result set needs to be transferred to the second stage, reducing network bandwidth consumption. 2. High efficiency: By computing aggregation locally, computational resource consumption is reduced, improving overall aggregation efficiency, especially in distributed or parallel processing environments. Use cases for two-stage aggregation: It is commonly used in big data environments, distributed databases, and data warehouses, such as performing GROUP BY operations when executing SQL queries to improve performance and reduce resource usage. In practice, many modern database systems and big data processing frameworks implement similar mechanisms to optimize aggregation queries.

[0029] Figures 1, 2, and 3 illustrate an example of two-stage aggregation.

[0030] Figure 1 illustrates one of the data aggregation methods, as shown in Figure 1: In a high cardinality data aggregation scenario, two data sets to be aggregated in a batch, "aaaabbbbbbaaaccc…cccxxxxxx" and "cccaaaaaxxxxxbbb…bbaaaacc", are aggregated locally on two local nodes respectively, resulting in two locally aggregated data sets: Local Aggregated Data 1 is shown in Table 1: Table 1

[0031] Local aggregated data 2 is shown in Table 2: Table 2

[0032] Local aggregated data 1 and local aggregated data 2 are sent to two central nodes for aggregation, resulting in two global aggregated data sets: Global aggregated data 1 is shown in Table 3: Table 3

[0033] Global aggregated data 2 is shown in Table 4: Table 4

[0034] Based on Tables 1 to 4, it can be seen that most of the data aggregation operations are completed during local aggregation, and only simple aggregation is needed on this basis during global aggregation. Therefore, when the aggregation degree of local aggregation is high, the amount of data accessed to memory can be significantly reduced, thus improving overall performance.

[0035] Figure 2 illustrates a second schematic diagram of a data aggregation method, shown in Figure 2: In a low cardinality data aggregation scenario, two data sets to be aggregated in a batch, "abcdef…xrule" and "nskly…znvpoz", are aggregated locally on two local nodes respectively, resulting in two locally aggregated data sets: Local Aggregated Data 1 is shown in Table 5: Table 5

[0036] Local aggregated data 2 is shown in Table 6: Table 6

[0037] Local aggregated data 1 and local aggregated data 2 are sent to two central nodes for aggregation, resulting in two global aggregated data sets: Global aggregated data 1 is shown in Table 7: Table 7

[0038] Global aggregated data 2 is shown in Table 8: Table 8

[0039] Based on Tables 5 to 8, it can be observed that data aggregation is largely incomplete during local aggregation, while global aggregation requires performing most of the aggregation operations on top of this. Therefore, when the aggregation degree of local aggregation is low, not only does it fail to reduce the amount of data accessed to memory, but the low cardinality aggregation keys accumulate locally, leading to a sharp increase in local cache and further causing overall performance degradation. A better approach in this case is to directly or as quickly as possible upload the low cardinality data to be aggregated to the global aggregation, avoiding local accumulation.

[0040] Figure 3 illustrates a third schematic diagram of a data aggregation method, showing a data aggregation scenario with a mixture of high and low cardinality. Two data sets to be aggregated in a batch, "aaaabbbbbbaaaccc…cccxxxxxx" and "cccaaaaaxxxxxbbb…bbaaaacc", are aggregated locally on two separate local nodes, resulting in two locally aggregated data sets. Locally aggregated data set 1 is shown in Table 9. Table 9

[0041] Local aggregated data 2 is shown in Table 10: Table 10

[0042] Local aggregated data 1 and local aggregated data 2 are sent to two central nodes for aggregation, resulting in two global aggregated data sets: Global aggregated data 1 is shown in Table 11: Table 11

[0043] Global aggregated data 2 is shown in Table 12: Table 12

[0044] According to Tables 9 to 12, it can be found that neither degenerating to direct global aggregation nor continuing two-stage aggregation achieves the desired performance. A mechanism is needed to distinguish between high-cardinality and low-cardinality aggregation bonds.

[0045] In most real-time scenarios, it's impossible to determine the cardinality of the data items corresponding to each aggregation key. One approach is to use statistical information in the database, employing techniques such as histograms, to summarize and estimate the cardinality of each aggregation key. However, when the data volume is too large, data updates are very frequent, or data updates are irregular, this statistical method fails to obtain an accurate cardinality for aggregation keys, especially when the aggregation keys are complex (multi-level keys) or when the aggregation key is the result of a function calculation of other aggregation keys. Therefore, it's impossible to set an accurate cardinality threshold for aggregation keys as a criterion for determining whether to execute a "one-stage aggregation" or "two-stage aggregation" strategy.

[0046] To address the aforementioned issues, this disclosure provides a data aggregation method, and also relates to another data aggregation method, a distributed system, a data aggregation device, another data aggregation device, a computing device, a computer-readable storage medium, and a computer program product, which will be described in detail in the following embodiments.

[0047] Referring to Figure 4, which shows a flowchart of a data aggregation method provided in an embodiment of this disclosure, the method is applied to local nodes of a distributed system, which also includes a central node, and includes the following specific steps.

[0048] Step 402: Obtain multiple data sets to be aggregated, where any one of the data sets to be aggregated corresponds to the first key.

[0049] The embodiments disclosed herein can be applied to various scenarios that use distributed systems for data aggregation, such as real-time data analysis (e.g., traffic management, advertising, smart cities), Internet of Things applications (e.g., smart homes, industrial monitoring), operation and maintenance management, and big data processing (e.g., search engines, video websites, social networking sites). The examples in the embodiments of this disclosure do not limit the application scenarios of the embodiments of this disclosure.

[0050] A distributed system is a system composed of multiple server nodes that transmit data over a network. In a distributed system, each server node can run programs independently or collaborate through message passing mechanisms to achieve the overall functionality of the distributed system. Distributed systems are highly scalable and reliable. In big data processing scenarios, distributed systems are used to process large amounts of data, accelerating the data processing process through parallel computing and distributed storage. For example, the Hadoop distributed system. Local nodes are server nodes in a distributed system responsible for performing local aggregation operations. Local nodes possess certain computing and storage resources to perform local aggregations, reducing the amount of data transmitted within the distributed system and thus improving overall system performance. For example, a DataNode in a Hadoop distributed system is responsible for storing and processing some data. A central node is a server node in a distributed system responsible for performing global aggregation operations. The central node coordinates and manages the local nodes in the distributed system, receiving local aggregation results (intermediate results) or data to be aggregated from each local node and performing global aggregation. For example, the NameNode in a Hadoop distributed system is responsible for managing and coordinating the file system of the entire Hadoop cluster. It should be noted that local nodes are server nodes in a distributed system used to aggregate partial data, while central nodes are server nodes in a distributed system used to aggregate locally aggregated data that has been aggregated by multiple local nodes. The two may not be fundamentally different in terms of physical aspects such as hardware and software resources and architecture; they only differ in their roles and responsibilities within the distributed system. They are a pair of relative concepts that work together to complete the two-stage aggregation strategy.

[0051] Multiple data items to be aggregated are data items that need to be aggregated on the local node. These multiple data items have corresponding aggregation keys and can belong to the same batch. For example, order data generated on a local node from 19:00 to 19:30 includes three aggregation keys: Merchant ID, Order Number, and Order Amount.

[0052] The first key is the aggregation key corresponding to any data to be aggregated. The aggregation key generally represents the category or data type of the data to be aggregated. Multiple data items can be aggregated into key-value pairs with a higher degree of aggregation using the aggregation key. For example, any one of the following three aggregation keys—merchant ID, order number, and order amount—can be used as the first key.

[0053] One possible way to obtain multiple data sets to be aggregated is to read them from a data source. This data source can be a database, file system, message queue, etc. For example, reading a batch of order data from the Hadoop Distributed File System (HDFS), where fields such as merchant ID, order number, and order amount can be used as aggregation keys. Another possible way is to receive multiple data sets to be aggregated via a message queue. Message queues can be used for processing real-time data streams, such as Kafka and RabbitMQ; for example, consuming a batch of order data from Kafka. A third possible way is to pull multiple data sets to be aggregated from a data interface. This data interface can be an Application Programming Interface (API) to obtain data from external systems, such as RESTful APIs or GraphQL.

[0054] For example, an e-commerce platform's order data is processed in real time on a Hadoop distributed system. This Hadoop distributed system includes two central nodes and four local nodes. On one of the local nodes, 1,000,000 order data entries are read from the Hadoop file distribution system. Each order data entry contains a merchant ID, order number, and order amount.

[0055] Multiple data sets to be aggregated are obtained, providing data support for subsequent local aggregation. Each data set to be aggregated corresponds to a first key, providing data support for subsequent updates to the popularity weight.

[0056] Step 404: Aggregate multiple data sets to obtain local aggregated data, and update the popularity weight of the first key.

[0057] Locally aggregated data refers to data items generated after aggregating multiple data sets on a local node. It is an intermediate result of a two-stage aggregation process. Locally aggregated results have a higher degree of aggregation compared to the multiple data sets to be aggregated. They include the aggregation key and its corresponding aggregation function value, presented as key-value pairs. Compared to directly transmitting multiple data sets to the central node for global aggregation, the local aggregation result performs preliminary aggregation before being transmitted to the central node for global aggregation. This reduces the amount of data transmitted over the network and improves the overall performance of the distributed system.

[0058] The "heat weight" of the first key is a quantitative indicator that measures the aggregation frequency of the data item corresponding to the first key over a period of time. This heat weight reflects the activity level of the first key. Some keys may be more important due to large data volume or high generation frequency, while others may be relatively inactive. By introducing a heat weight mechanism, compared to a fixed cardinality threshold, the data aggregation strategy of keys can be dynamically and flexibly evaluated and adjusted, thereby optimizing the data processing flow and improving system performance. Heat weights include, but are not limited to: total aggregation steps, interval steps, data volume, and other indicators (such as order amount, user access frequency, etc., depending on specific needs).

[0059] It should be noted that data aggregation has corresponding aggregate functions, which, taking SQL statements as an example, include but are not limited to: SELECT function, SUM function, COUNT function, AVG function, MAX function, MIN function, and GROUP BY function.

[0060] To aggregate multiple pieces of data and obtain local aggregated data, one option is to perform aggregation function calculations on the data. Another option is to use a hash table to aggregate the data and obtain local aggregated data. For example, create a hash table to store the aggregation results, iterate through the data, determine the first key of each data item as its hash value, assign it to the corresponding bucket in the hash table, calculate the aggregation function value in each bucket, and generate the local aggregation result. A third option is a combination of the two methods mentioned above, which is not limited here.

[0061] To update the popularity weight of the first key, one option is to add a step count (fixed value) to the popularity weight of the first key, which is usually 1. Another option is to add an aggregation interval step count (non-fixed value) to the popularity weight of the first key. Yet another option is to add the amount of data to be aggregated corresponding to the first key to the popularity weight of the first key. This is not limited here.

[0062] For example, an aggregate query statement (aggregate function) includes: SELECT seller_number, COUNT(order_number) as number of orders; FROM order_table; GROUP BY seller_number.

[0063] Create a hash table on the local node to store the local aggregation results. Iterate through 1,000,000 order data, determine the first key "seller ID" of each data item as the hash value, and assign it to the corresponding bucket in the hash table. Perform aggregation function calculation in each bucket to obtain the local aggregated data as follows: seller ID 'user1', number of orders 20,000; seller ID 'user4', number of orders 300; seller ID 'user9', number of orders 20, and so on. Increment the total number of aggregation steps for the first key "seller ID" ('user1', 'user4', 'user9', etc.) by 1.

[0064] Multiple datasets to be aggregated are aggregated to obtain local aggregated data, providing intermediate result data for subsequent global aggregation. This achieves two-stage aggregation, reduces data transmission volume, lowers network bandwidth usage, and updates the heat weight of the first key to reduce computational resource consumption and improve overall aggregation efficiency. This provides weight support for determining the target key from at least one key in the subsequent aggregation process.

[0065] Step 406: Determine the target key from at least one key based on the heat weight of each data in the local node corresponding to at least one key, wherein at least one key includes the first key.

[0066] Each data item in a local node is a data item stored and processed on the local node, including data to be aggregated and / or locally aggregated data. If the local node has a data cache, this data includes not only the data to be aggregated currently requiring aggregation and the aggregated local data, but also previous data to be aggregated and / or aggregated local data.

[0067] Each data item corresponds to at least one key, which is the aggregation key for each data item (data to be aggregated or local aggregated data). The aggregation key generally represents the category or data type of the data to be aggregated. Multiple data items to be aggregated can be aggregated into key-value pairs with a higher degree of aggregation using the aggregation key.

[0068] The popularity weight of at least one key is a quantitative indicator measuring the aggregation frequency of data items corresponding to at least one key over a period of time. Popularity weights include, but are not limited to: total aggregation steps, interval steps, data volume, and other indicators (such as order amount, user access frequency, etc., depending on specific needs). The target key is the aggregation key corresponding to the data item that needs to be globally aggregated, determined based on the popularity weight of each data item corresponding to at least one key. Since the target key is selected based on the popularity weight of at least one key, and the popularity weight is updated after local aggregation, the target key is dynamic and flexible. For example, data items with high popularity weight are not transmitted to the central node for global aggregation, but are retained on the local node to avoid repeated transmission when retrieving data items for aggregation. On the other hand, data items with low popularity weight are directly transmitted to the central node for global aggregation. By retaining the aggregation key with high popularity weight, it can be ensured that important data with high aggregation needs over a period of time is prioritized for processing on the local node.

[0069] Based on the popularity weight of each data in the local node corresponding to at least one key, the target key is determined from at least one key. One possible approach is to determine the target key from at least one key in ascending order of popularity weight of each data in the local node corresponding to at least one key.

[0070] It should be noted that, based on the popularity weight of each data item corresponding to at least one key in the local node, the target key is determined from at least one key. This can be achieved by setting a window, including but not limited to: a time window and a spatial window. When this window is triggered, the target key is determined from at least one key based on the popularity weight of each data item corresponding to at least one key in the local node. Specifically, when a preset time window is triggered, the target key is determined from at least one key based on the popularity weight of each data item corresponding to at least one key in the local node. Alternatively, when the spatial window of the cache space on the local node is reached, the target key is determined from at least one key based on the popularity weight of each data item corresponding to at least one key in the local node.

[0071] For example, when the hash buckets cannot be expanded and the hash table overflows, the total number of aggregation steps is determined based on the 20 keys "seller ID" ('user1', 'user2', ... 'user20') corresponding to each data item in the local node (data to be aggregated and local aggregated data): Seller ID 'user1', total aggregation steps 30000; Seller ID 'user2', total aggregation steps 50000; ... Seller ID 'user20', total aggregation steps 45000. The target key is determined from the 20 keys in ascending order: Seller ID 'user1'.

[0072] Based on the heat weight of each data in the local node corresponding to at least one key, the target key is determined from at least one key. The at least one key includes the first key. The target key with high and low cardinality is dynamically and flexibly distinguished, which provides support for the subsequent transmission of target data to the central node.

[0073] Step 408: Transfer the target data corresponding to the target key to the central node for aggregation to obtain global aggregated data.

[0074] The target data corresponding to the target key is the data item on the local node corresponding to the target key, including the data to be aggregated and / or the local aggregated data. If the local node has data caching, this data includes not only the currently aggregated local aggregated data, but also the previous data to be aggregated and / or the aggregated local aggregated data.

[0075] Global aggregated data refers to data items generated after data aggregation of target data on the central node. It is the final result of two-stage aggregation. The global aggregated result is obtained by further data aggregation based on the target data transmitted from at least one local node. The global aggregated result has a higher degree of aggregation compared to multiple data to be aggregated on local nodes and local aggregation results. The global aggregated result contains the aggregation key and the aggregation function value corresponding to the aggregation key, and is represented in the form of key-value pairs.

[0076] The data is transmitted to the central node for aggregation to obtain globally aggregated data. One possible method for global aggregation at the central node is to perform aggregation function calculations on the target data. Another possible method is to use a hash table to aggregate the target data. A third possible method is a combination of the two methods mentioned above, which is not limited here.

[0077] For example, all data items of the target key (merchant ID 'user1') are retrieved from their respective hash buckets and transmitted to two central nodes. On the two central nodes, all data items of the target key (merchant ID 'user1') from the four local nodes are integrated, and an aggregation function is performed to obtain global aggregated data: seller ID 'user1', number of orders 100,000.

[0078] In this embodiment of the disclosure, by introducing a key heat weighting mechanism on the local node, the frequency of key aggregation over a period of time is reflected. Based on the heat weight of each data corresponding to at least one key in the local node, the target key is determined from at least one key. The target keys with high and low cardinality are dynamically and flexibly distinguished. The target data corresponding to the target key is transmitted to the central node for global aggregation to obtain global aggregated data. A suitable aggregation strategy is then executed, which not only improves the efficiency of local aggregation and reduces unnecessary data transmission overhead, but also significantly improves the speed and efficiency of data aggregation in the distributed system.

[0079] Another problem that can easily arise in two-stage aggregation is that a large number of data items accumulate on the local node, which seriously occupies the cache space of the local node, causing a sharp drop in data access efficiency and affecting the effect of local aggregation.

[0080] To address the aforementioned issues, in one optional embodiment of this disclosure, the local node includes a cache space; after step 402, the following specific steps are further included: writing multiple data to be aggregated into the cache space; correspondingly, step 406 includes the following specific steps: when the size of the cache space reaches a preset threshold, determining the target key from at least one key according to the heat weight of each data in the cache space corresponding to at least one key; correspondingly, step 408 includes the following specific steps: retrieving the target data corresponding to the target key from the cache space and transmitting it to the central node for aggregation to obtain global aggregated data.

[0081] The cache space is a memory area on the local node used for temporary data storage. The cache space can be a hash table, hash bucket, or other data structure used to store data to be aggregated and local aggregation results. The main purpose of the cache space is to reduce frequent data read and write operations and improve data processing efficiency. The cache space stores the data items on the local node that currently need to be aggregated; this can be data to be aggregated directly from the data source or locally aggregated data. The cache space is designed to reduce the amount of data transmitted over the network and improve the overall performance of the system. The cache space can correspond to a range of memory addresses on the local node. The size of the cache space is the amount of data currently stored in the cache space and is dynamically changing. When the size of the cache space reaches a preset threshold, a cache eviction mechanism is triggered to determine the target key, and then the target data corresponding to the target key is transferred to the central node for global aggregation.

[0082] The preset threshold is a critical value set for the cache space. When the amount of data in the cache space reaches or exceeds this critical value, the cache eviction mechanism will be triggered. Setting the preset threshold appropriately can balance memory usage and data processing efficiency. Generally, the size of this threshold is set according to the size of the processor cache. For example, if the preset threshold size is M, it generally does not exceed the size of the L3 cache, that is, less than 16MB.

[0083] One option for writing multiple pieces of data to be aggregated into the cache space is to write them into the cache subspace of the cache space.

[0084] When the size of the cache space reaches a preset threshold, one possible method is to ensure that the size of each cache subspace of the cache space reaches the preset threshold. Another possible method is to ensure that the size of any cache subspace of the cache space reaches the preset threshold. Yet another possible method is to ensure that the total size of any cache subspace of the cache space reaches the preset threshold. No further restrictions are imposed here.

[0085] One possible way to retrieve the target data corresponding to the target key from the cache space is to retrieve the target data corresponding to the target key from the cache subspace of the cache space.

[0086] For example, a hash table is created on the local node to store the data to be aggregated. 1,000,000 order data entries are written into the hash buckets of the hash table. The merchant ID for each order is used as the hash key, and the order number, order amount, and number of orders are used as the value. The preset threshold for the hash table size is 16MB. When the hash table reaches 16MB and is about to overflow, a cache eviction mechanism is triggered: based on the total number of aggregation steps for each data item (data to be aggregated and local aggregated data) corresponding to the 20 keys "seller ID" ('user1', 'user2', ... 'user20'): Seller ID 'user1', total aggregation steps 30,000; Seller ID 'user2', total aggregation steps 50,000; ... Seller ID 'user20', total aggregation steps 45,000. The target key is determined from the 20 keys in ascending order: Seller ID 'user1'. All data items for the target key (merchant ID 'user1') are retrieved from their respective hash buckets and transmitted to two central nodes. On the two central nodes, all data items for the target key (merchant ID 'user1') from the four local nodes are integrated, and an aggregation function is performed to obtain the global aggregated data: seller ID 'user1', number of orders 100,000.

[0087] In this embodiment of the disclosure, the caching mechanism is optimized in a fine-grained manner. Based on the introduction of a popularity weight mechanism, when the size of the cache space reaches a preset threshold, the cache space is cleaned up according to the popularity weight, and the data items corresponding to the aggregation keys with high popularity weights are intelligently retained, thereby improving cache utilization and reducing unnecessary memory accesses, and further improving the efficiency of local aggregation.

[0088] In one optional embodiment of this disclosure, the cache space is pre-divided into multiple cache subspaces, each of which has a corresponding space capacity; writing multiple data to be aggregated into the cache space on the local node includes the following specific steps: writing multiple data to be aggregated into the target cache subspace where the first key is located; and expanding the target cache subspace when the space size of the target cache subspace reaches the corresponding space capacity.

[0089] A cache subspace is a sub-region of the cache space. A cache subspace can be a hash bucket or other data structure, used to store data to be aggregated corresponding to a specific aggregation key or local aggregation results. The design of cache subspaces aims to better manage and optimize the use of cache space, improving data processing efficiency. In a distributed system, a cache subspace can be used to store data items corresponding to at least one aggregation key; data items corresponding to the same aggregation key are generally stored in one cache subspace. Each cache subspace has independent space capacity, which can be dynamically adjusted according to data characteristics and needs. By allocating data items to different cache subspaces, data can be managed more effectively, avoiding over-occupancy of a single cache space. A cache subspace can correspond to a range of memory addresses on a local node. Each cache subspace records the aggregation key, aggregation function value, and popularity weight of the data item. The size of a cache subspace is the amount of data currently stored in the cache subspace and is dynamically changing. When the size of the cache space reaches its capacity, a cache expansion mechanism is triggered to increase the space capacity. The target cache subspace is used to store the first key; data items corresponding to the first key are generally stored in the target cache subspace. The capacity of a cache subspace is the upper limit of the amount of data that can be stored in the cache subspace. This capacity is determined during pre-allocation and can be adjusted according to storage conditions.

[0090] For example, 1,000,000 order data entries are written into 8 hash buckets, each containing an aggregation key (merchant ID) in the hash table. The merchant ID for each order is used as the hash key, and the number of orders and popularity weight are used as the value. The preset threshold for the hash table size is 16MB, and the capacity of any hash bucket is 1MB. When the data volume in the hash table reaches 8MB and is about to overflow, a resizing mechanism is triggered: the capacity of the hash bucket is multiplied by 2.

[0091] In this embodiment of the disclosure, by dividing the cache space into multiple cache subspaces and setting an independent space capacity for each cache subspace, data storage can be managed more finely. When the space size of the target cache subspace reaches its space capacity, it is dynamically expanded to ensure efficient storage and processing of data. This not only avoids excessive occupation of a single cache space and improves cache utilization, but also reduces frequent data access operations, significantly improving the efficiency of local aggregation and the overall performance of the system.

[0092] In one optional embodiment of this disclosure, the heat weight of the first key includes the total number of aggregation steps and the interval steps since the last aggregation; the heat weight of the first key in step 404 includes the following specific steps: if the first key is appearing for the first time, initialize the total number of aggregation steps and the interval steps of the first key; if the first key is not appearing for the first time, increment the total number of aggregation steps of the first key by one, and update the interval steps of the first key to the current number of aggregation steps, wherein the current number of aggregation steps is incremented by one after aggregating the data to be aggregated corresponding to any one of the at least one keys.

[0093] The total number of aggregation steps represents the cumulative number of times the first key has been executed by aggregation operations over a period of time. It is an important indicator of the activity level of this aggregation key, reflecting the aggregation frequency of its data items. The total number of aggregation steps can identify which aggregation keys have high cardinality and which have low cardinality. The interval steps represent the number of times between the last and current aggregation operations performed on the first key. It is an important indicator of the activity interval of this aggregation key between two aggregations, reflecting which aggregation keys are continuously active and which are intermittently active. The current aggregation steps represent the total number of aggregation operations executed globally. The current aggregation steps are a global counter used to track the progress of aggregation operations throughout the system.

[0094] It should be noted that, in order to efficiently manage and compare the heat weights, the total number of aggregated steps and the number of interval steps can be compressed into a long integer value heat weight. The heat weight is calculated as shown in Formula 2: Heat weight = ((long)total number of aggregated steps << 32) | number of interval steps Formula 2.

[0095] Formula 2 combines two 32-bit integer values ​​(total aggregate steps and interval steps) into a single 64-bit long integer value. This ensures that a larger total aggregate steps (i.e., more repetitions) and a larger interval steps (i.e., more recent updates) result in a higher popularity weight.

[0096] Optionally, the maximum value `max_count` and the minimum value `min_count` of the total aggregate steps are maintained in real time. These values ​​are reset each time a cache cleanup operation is performed.

[0097] It should be noted that by taking into account both the total number of aggregation steps and the number of update steps in this embodiment, the aggregation columns of hot values ​​can be prevented from occupying the cache for a long time. Even if a value is repeated a lot, if all the same key values ​​have been aggregated, or if they are distributed far apart and scattered in the table, the hotness weight determined by comprehensively considering the total number of aggregation steps and the number of update steps can ensure that these values ​​will not occupy the cache space for a long time, thereby achieving a better local aggregation effect.

[0098] For example, if the first key is appearing for the first time, initialize the total number of aggregation steps for the first key: count = 1, initialize the interval step for the first key: last_update_step = 0. If the first key is not appearing for the first time, increment the total number of aggregation steps for the first key: count = count + 1, update the interval step for the first key to the current aggregation step: last_update_step = current_step, where current_step is the total number of steps counter, and aggregation operations for any aggregation key will trigger current_step + 1.

[0099] In this embodiment of the disclosure, the total number of aggregation steps of the first key can reflect the activity level of the aggregation key, while the interval number of the first key can measure the activity interval of the aggregation key. The combination of the two not only helps to identify high cardinality and low cardinality aggregation keys, but also ensures that even for data items that are repeated many times but are scattered, resources will not be wasted due to long-term cache occupation.

[0100] For a given aggregation key, it is generally believed that two-stage aggregation will only yield good performance gains if the aggregation degree of local aggregation exceeds G (see Formula 1). Otherwise, it is better to directly transmit to the central node for global aggregation. A cache cleanup strategy can be adopted: In one optional embodiment of this disclosure, a cache cleanup strategy is provided. The heat weight includes the total number of aggregation steps. When the cache space size reaches a preset threshold, the target key is determined from at least one key based on the heat weight of each data item corresponding to at least one key in the cache space. This includes the following specific steps: When the cache space size reaches a preset threshold, the quantile of the total number of aggregation steps is determined based on a preset aggregation degree; based on the quantile of the total number of aggregation steps, the total number of aggregation steps for at least one key in the cache space is traversed to determine the target key from at least one key.

[0101] The preset aggregation degree is a threshold set to measure the aggregation effect of two-stage aggregation. It is determined based on the amount of data before and after aggregation, as shown in Formula 1. When the aggregation degree exceeds this value after aggregating data on the local node, two-stage aggregation is more advantageous than direct global aggregation. The preset aggregation degree is used to evaluate the effectiveness of local aggregation. If the local aggregation degree of an aggregation key exceeds the preset aggregation degree, it means that the aggregation key has been sufficiently aggregated on the local node, which can reduce the amount of data transferred and thus improve the overall system performance. Conversely, if the aggregation degree does not reach the preset aggregation degree, it is more efficient to directly transfer the data to the central node for global aggregation. For example, if the preset aggregation degree G is set to 6, it means that if the aggregation degree (i.e., the total number of aggregation steps) of an aggregation key on the local node exceeds 6, then local aggregation is considered beneficial. If the total number of aggregation steps for an aggregation key is 7, then the data items of that aggregation key can be aggregated on the local node and then transferred to the central node for global aggregation. If the total number of aggregation steps is 5, then it may be more appropriate to directly transfer the data to the central node for global aggregation.

[0102] The aggregate total steps quantile is the value at a certain percentile among the total aggregation steps of at least one key. The aggregate total steps quantile is used to determine which aggregation keys have low aggregate total steps, thus deciding whether the data items corresponding to these keys need to be cleared. In the cache cleanup strategy, the aggregate total steps quantile determines which aggregation key data items should be cleared and directly transmitted to the central node for global aggregation. By calculating the aggregate total steps quantile, aggregation keys with low aggregation degrees can be dynamically identified, preventing these keys from occupying cache space for a long time, thereby improving cache utilization and overall system performance. For example, if the aggregate total steps of aggregation keys are 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100, and the preset aggregation degree G is set to 6, then (G-1) / G needs to be calculated as the aggregate total steps quantile, which is the 83rd percentile. 0.83 × 10 = 8.3, taking the value at the 9th position, the aggregate total steps quantile is 90. Data items corresponding to aggregation keys (10, 20, 30, 40, 50, 60, 70, 80) with a total aggregation step count below 90 will be cleared and directly transmitted to the central node for global aggregation.

[0103] In this embodiment of the disclosure, by introducing a preset aggregation degree and a total aggregation step quantile, data items in the cache space are dynamically managed to ensure that data corresponding to keys with high aggregation degree are processed on the local node first, while data corresponding to keys with low aggregation degree are directly transmitted to the central node, thereby optimizing cache utilization, reducing unnecessary data transmission, and significantly improving the overall performance and data processing efficiency of the system.

[0104] Based on the preset aggregation degree, the quantiles of the total aggregation steps are determined. If a precise quantile of the total aggregation steps is needed, sorting is often required, resulting in a complexity of O(N^2), which is detrimental to local aggregation performance. Therefore, an estimation method can be used to determine the quantiles of the total aggregation steps.

[0105] In one optional embodiment of this disclosure, determining the quantile of the total aggregation steps according to a preset aggregation degree includes the following specific steps: dividing the numerical range of the total aggregation steps of at least one key in the cache space to obtain multiple numerical sub-ranges; counting the number of total aggregation steps in the multiple numerical sub-ranges, and estimating the target numerical sub-range where the quantile of the total aggregation steps is located based on the number and the preset aggregation degree; and determining the median of the target numerical sub-range as the quantile of the total aggregation steps.

[0106] The total aggregation steps range is the interval between the maximum and minimum total aggregation steps for at least one key. This range helps determine the distribution of the data and how to divide the data into sub-ranges for more efficient estimation of the total aggregation steps quantiles. Understanding the range allows for more accurate interval partitioning, reducing computational complexity. For example, if the total aggregation steps for all aggregate keys in the cache space are 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100, then the total aggregation steps range is from 10 to 100.

[0107] Numerical subranges divide the total number of aggregated steps into multiple smaller intervals, each called a numerical subrange. This division simplifies the quantile calculation process. By dividing the data into multiple intervals, the number of data points within each interval can be quickly counted, thus estimating the interval containing the quantile. This method reduces the computational complexity from O(N^2) to O(N). For example, if the total number of aggregated steps ranges from 10 to 100, it can be divided into 10 numerical subranges: Subrange 1: [10, 19]; Subrange 2: [20, 29]; Subrange 3: [30, 39]; Subrange 4: [40, 49]; Subrange 5: [50, 59]; Subrange 6: [60, 69]; Subrange 7: [70, 79]; Subrange 8: [80, 89]; Subrange 9: [90, 99]; Subrange 10: [100, 109].

[0108] By counting the total number of aggregation steps within each numerical subrange, the interval in which the quantile of the total aggregation steps falls can be quickly determined. For example, the total aggregation steps are: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100. These values ​​are placed into the aforementioned 10 subranges, and the number within each subrange is counted as follows: Subrange 1: [10, 19] – 1; Subrange 2: [20, 29] – 1; Subrange 3: [30, 39] – 1; Subrange 4: [40, 49] – 1; Subrange 5: [50, 59] – 1; Subrange 6: [60, 69] – 1; Subrange 7: [70, 79] – 1; Subrange 8: [80, 89] – 1; Subrange 9: [90, 99] – 1; Subrange 10: [100, 109] – 1.

[0109] The target numerical subrange is a range of values ​​containing quantiles, determined based on the preset aggregation degree and the total number of aggregation steps. By summing the total number of aggregation steps within each numerical subrange, the interval where the accumulated number reaches the quantile requirement can be determined; this interval is the target numerical subrange. For example, if the preset aggregation degree G is 6, (G-1) / G needs to be calculated as the quantile of the total number of aggregation steps, i.e., the 83rd quantile, with one quantile in each subrange. The accumulated number continues until the 83rd quantile is reached: the accumulated number in the first 8 subranges is 8, accounting for 80% of the total; the accumulated number in the 9th subrange is 9, accounting for 90% of the total. Therefore, the 9th subrange [90, 99] is the target numerical subrange.

[0110] The median of a target subrange refers to the total number of aggregate steps located in the middle position within that subrange. This number is used to estimate the specific value of the quantile. By taking the median of the target subrange, the value of the quantile can be approximately determined, thus avoiding complex sorting operations. For example, if the target subrange is [90, 99], and there is only one value, 90, within this subrange, then the median of the target subrange is 90. If there are multiple values ​​within the subrange, such as [90, 91, 92, 93, 94, 95, 96, 97, 98, 99], then the median is (94 + 95) / 2 = 94.5.

[0111] In this embodiment of the disclosure, the value of the total aggregation step quantile is estimated in O(N) complexity, which reduces the performance consumption of determining the total aggregation step quantile. This allows for effective management of cache space, ensuring that keys with high aggregation degree are processed on local nodes first, while keys with low aggregation degree are directly transmitted to the central node, thereby improving the overall performance and data processing efficiency of the system.

[0112] In one optional embodiment of this disclosure, a cache cleanup strategy is provided. The popularity weight includes the total number of aggregation steps. When the cache space size reaches a preset threshold, a target key is determined from at least one key based on the popularity weight of each data item corresponding to at least one key in the cache space. This includes the following specific steps: When the cache space size reaches a preset threshold, based on a preset aggregation degree, the total number of aggregation steps for at least one key in the cache space is traversed to determine the target key from at least one key.

[0113] It should be noted that keys with a total aggregation step count lower than the preset aggregation degree G are identified as target keys and are cleared. This is to prevent the weight of the (G-1) / G quantile from remaining low cardinality when the aggregation column distribution is extremely uneven, making it difficult to clean up low cardinality buckets.

[0114] For example, the total aggregation steps for all aggregation keys in the cache space are 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100. The preset aggregation degree G is set to 6. Traversing these total aggregation steps, it is found that keys with total aggregation steps of 10, 20, 30, 40, and 50 are below the preset aggregation degree G. Therefore, the data items corresponding to these keys will be cleared and directly transmitted to the central node for global aggregation.

[0115] In this embodiment of the disclosure, by introducing a preset aggregation degree, data items in the cache space are dynamically managed to ensure that data items corresponding to keys with low aggregation degree are promptly cleared and transmitted to the central node for global aggregation. This avoids the cache space being occupied by low cardinality keys for a long time, improves cache utilization and overall system performance, reduces unnecessary data transmission, optimizes the efficiency of local aggregation, and significantly improves the data processing capability of the distributed system.

[0116] In one optional embodiment of this disclosure, a cache cleanup strategy is provided. The popularity weight includes the interval number of steps since the last aggregation. When the cache space size reaches a preset threshold, a target key is determined from at least one key based on the popularity weight of each data item corresponding to at least one key in the cache space. This includes the following specific steps: When the cache space size reaches a preset threshold, the target key is determined by traversing the interval number of at least one key according to a preset interval number threshold.

[0117] The preset interval step threshold is a threshold used to measure the frequency of activity of an aggregation key. If the interval step count for an aggregation key exceeds the preset interval step threshold, it means that the aggregation key has not been executed by aggregation operations for a long time. This may be because all identical key values ​​have already been aggregated locally, or the key values ​​are widely distributed across the table. In this case, the data items corresponding to these keys can be cleared to free up cache space.

[0118] It should be noted that keys with an interval exceeding a preset threshold are identified as target keys and are cleared. This prevents keys with large cardinality but whose identical key values ​​have already been fully aggregated locally, or keys whose key values ​​are widely distributed across the table, from occupying local aggregation memory space for an extended period.

[0119] For example, the total number of aggregation steps for all aggregation keys in the cache space are 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100. The preset interval step threshold is set to the total number of updates between 3 rounds of cache cleanup, with 30 updates between each round of cache cleanup. Iterating through these interval steps, keys with interval steps of 90 and 100 exceed the preset interval step threshold (30 × 3 = 90). Therefore, the data items corresponding to these keys will be cleared and directly transmitted to the central node for global aggregation.

[0120] In this embodiment of the disclosure, by introducing a preset interval step threshold, data items in the cache space are dynamically managed to ensure that data items corresponding to keys that have not been active for a long time are promptly cleared and transmitted to the central node for global aggregation. This avoids the cache space being occupied by inactive keys for a long time, improves cache utilization and overall system performance, optimizes the efficiency of local aggregation, reduces unnecessary data transmission, and significantly improves the data processing capability and response speed of the distributed system.

[0121] In one optional embodiment of this disclosure, after step 402, the following specific steps are further included: determining whether the frequency of the first key exceeds a preset threshold; if the frequency of the first key does not exceed the preset threshold, transmitting multiple data to be aggregated to the central node for aggregation to obtain global aggregated data; correspondingly, step 404 includes the following specific steps: if the frequency of the first key exceeds the preset threshold, aggregating multiple data to be aggregated to obtain local aggregated data, and updating the popularity weight of the first key.

[0122] The frequency of the first key is the number of times the data item corresponding to the first key appears within a preset time period. This frequency reflects the activity level of the first key in the data stream. The frequency of the first key is used to evaluate the size and generation speed of the data corresponding to this aggregation key. By monitoring this frequency, it is possible to dynamically decide whether to use a one-stage or two-stage aggregation strategy to process the data items corresponding to the first key, thereby reducing network transmission and resource overhead of the central node. High-frequency keys usually mean a large data volume and a fast generation speed, making them suitable for local aggregation followed by global aggregation; while low-frequency keys may be more suitable for direct global aggregation.

[0123] The preset threshold is used to determine whether the frequency of the first key reaches the local aggregation standard. It can be set according to specific application scenarios and needs. The preset threshold is a parameter that distinguishes between high cardinality (frequency) and low cardinality (frequency) data.

[0124] For example, the preset time period is 1 hour, and the preset threshold is 10 times. Within 1 hour, seller ID "user1" appears 20,000 times, seller ID "user3" appears 7 times, seller ID "user4" appears 300 times, and seller ID "user9" appears 20 times. For seller ID "user1", its frequency is 20,000 times / hour, exceeding the preset threshold of 10 times / hour. For seller ID "user3", its frequency is 7 times / hour, not exceeding the preset threshold of 10 times / hour. For seller ID "user4", its frequency is 300 times / hour, exceeding the preset threshold of 10 times / hour. For seller ID "user9", its frequency is 20 times / hour, exceeding the preset threshold of 10 times / hour. For seller ID "user3", all data items of seller ID "user3" are transmitted to two central nodes. The two central nodes perform aggregation function calculations on all data items of seller ID "user3" from 4 local nodes to obtain global aggregated data: seller ID 'user1', number of orders 300. For seller IDs "user1", "user4", and "user9", a hash table is created on the local node to store the local aggregation results. Iterating through 1,000,000 order data items, the first key "seller ID" of each data item is determined as the hash value, and it is assigned to the corresponding bucket in the hash table. Aggregation functions are calculated in each bucket to obtain the following local aggregated data: seller ID 'user1', 20,000 orders; seller ID 'user4', 300 orders; seller ID 'user9', 20 orders.

[0125] In this embodiment of the disclosure, by determining whether the frequency of the first key exceeds a preset threshold, a suitable aggregation strategy is flexibly selected, which effectively reduces the amount of data transmitted over the network, alleviates the load on the central node, and improves the overall system performance and efficiency.

[0126] In one optional embodiment of this disclosure, the (1 / G) keys remaining after the above three cache cleanup strategies are retained and participate in the next round of aggregation process until the cache space size reaches the preset threshold again, triggering a new round of cache cleanup.

[0127] Referring to the embodiment in Figure 4, Figure 5 shows a schematic diagram of a data aggregation method provided by an embodiment of this disclosure. As shown in Figure 5, multiple data to be aggregated in a batch are locally aggregated on two local nodes to obtain two locally aggregated data sets: Locally aggregated data set 1 is shown in Table 13: Table 13

[0128] Local aggregated data 2 is shown in Table 14: Table 14

[0129] During the continuous local aggregation process, when Local Aggregation 1 and Local Aggregation 2 reach the point where the hash table space is full, they begin to perform cache cleanup using a hot weighting method. Assuming aggregation degree G = 6, a preset interval threshold of 20,000 steps, and the current aggregation step count of 50,000, this means that keys whose interval since the last aggregation is less than 30,000 steps will be cleared.

[0130] For local aggregation 1: the total number of steps for 'user3' to 'user7' is below the 83rd percentile of total steps, indicating a low cardinality key. The interval steps for 'user9' exceed the limit, classifying it as a high cardinality hotspot value, but the update time is too long. For local aggregation 2: the total number of steps for 'user9' to 'user7' is below the 83rd percentile of total steps, indicating a low cardinality key.

[0131] The interval steps for 'user3' exceeded the limit, indicating a high cardinality hotspot value, but the update time was too long. The hash buckets corresponding to these keys will be cleared and uploaded to the central node for global aggregation.

[0132] Two global aggregate data sets were obtained: Global aggregate data set 1 is shown in Table 15: Table 15

[0133] Global aggregated data 2 is shown in Table 16: Table 16

[0134] Proceed to the next round of aggregation until all local aggregation processing is complete.

[0135] As shown in Figure 5, this embodiment of the present disclosure presents a two-stage aggregation method that combines dynamic aggregation strategy and hash table management optimization. Unlike existing direct global aggregation and two-stage aggregation methods, this embodiment dynamically adjusts the aggregation strategy based on cardinality: keys with high cardinality (frequency) are preferentially aggregated locally, while keys with low cardinality (frequency) are directly uploaded to global aggregation. This approach makes local aggregation more efficient, reducing unnecessary computation and transmission. Simultaneously, this embodiment manages local aggregation results through a hash table cache eviction mechanism. Whenever the hash table space is full, hash buckets corresponding to some keys are cleaned up based on popularity weights such as occurrence frequency and most recent update time, further improving the efficiency of local aggregation. Finally, based on the occurrence frequency and time factors of aggregated keys, a popularity weight mechanism is introduced to improve the utilization efficiency of the local aggregation hash table memory space, dynamically manage hot aggregated keys, prevent cache from being occupied for a long time, and significantly improve the response capability of the distributed system. Furthermore, it is more adaptable, able to dynamically adjust in different aggregation scenarios, including high aggregation degree, low aggregation degree, and mixed high and low aggregation degrees.

[0136] Referring to Figure 6, which shows a flowchart of another data aggregation method provided in an embodiment of this disclosure, the method is applied to the central node of a distributed system, which also includes local nodes. The method includes the following specific steps: Step 602: Receive target data corresponding to a target key transmitted by the local node. The step of determining the target key at the local node includes: acquiring multiple data to be aggregated; aggregating the multiple data to be aggregated to obtain local aggregated data; updating the heat weight of the first key; and determining the target key from at least one key based on the heat weight of each data point corresponding to at least one key in the local node. Each data to be aggregated corresponds to the first key, and at least one key includes the first key. Step 604: Aggregate the target data to obtain global aggregated data.

[0137] To aggregate the target data and obtain globally aggregated data, one possible approach is to perform aggregation function calculations on the target data. Another possible approach is to use a hash table to aggregate the target data and obtain globally aggregated data. A third possible approach is a combination of the two approaches mentioned above, which is not limited here.

[0138] In this embodiment of the disclosure, a hot weight mechanism for keys is introduced on the local node to reflect the frequency of aggregation of keys over a period of time. Based on the hot weight of each data in the local node corresponding to at least one key, a target key is determined from at least one key. The target keys with high and low cardinality are dynamically and flexibly distinguished. The target data corresponding to the target key is transmitted to the central node, where global aggregation is performed to obtain global aggregated data. A suitable aggregation strategy is then executed. This not only improves the efficiency of local aggregation and reduces unnecessary data transmission overhead, but also significantly improves the speed and efficiency of data aggregation in the distributed system.

[0139] It should be noted that the technical solution for data aggregation in Figure 6 is based on the same concept as the technical solution for data aggregation in Figure 4 above. For details not described in detail in the technical solution for data aggregation in Figure 6, please refer to the description of the technical solution for data aggregation in Figure 4 above.

[0140] The following description, in conjunction with Figure 7, uses the application of the data aggregation method provided in this disclosure to IoT data as an example to further illustrate the data aggregation method. Figure 7 shows a flowchart of the processing procedure of a data aggregation method applied to IoT data according to an embodiment of this disclosure. The method is applied to local nodes in a distributed IoT system, which also includes a central node. The specific steps are as follows.

[0141] Step 702: Acquire multiple IoT data collected by the sensing device, wherein any IoT data corresponds to the first key.

[0142] Step 704: If the frequency of the first key does not exceed the preset threshold, transmit multiple IoT data to the central node for aggregation to obtain global aggregated data.

[0143] Step 706: If the frequency of the first key exceeds the preset threshold, write multiple IoT data into the target Hash bucket where the first key is located, and expand the target Hash bucket when the space size of the target Hash bucket reaches the corresponding space capacity.

[0144] Step 708: Aggregate multiple IoT data sets to obtain local aggregated data, and update the total number of aggregation steps for the first key and the interval steps since the last aggregation. Updating the total number of aggregation steps for the first key and the interval steps since the last aggregation includes: if the first key is appearing for the first time, initialize the total number of aggregation steps for the first key and the interval steps for the first key; if the first key is not appearing for the first time, increment the total number of aggregation steps for the first key by one, and update the interval steps for the first key to the current number of aggregation steps. The current number of aggregation steps is incremented by one after aggregating the IoT data corresponding to any one of the at least one keys.

[0145] Step 710: When the size of the hash table space reaches a preset threshold, determine the quantile of the total number of aggregation steps according to the preset aggregation degree. Based on the quantile of the total number of aggregation steps, traverse the total number of aggregation steps for at least one key in the hash table space and determine the first target key from at least one key. According to the preset aggregation degree, traverse the total number of aggregation steps for at least one key in the hash table space and determine the second target key from at least one key. According to the preset interval step threshold, traverse the interval step for at least one key and determine the third target key from at least one key.

[0146] Step 712: Retrieve the target data corresponding to the first target key, the second target key, and the third target key from the hash buckets corresponding to the hash table space, and transmit them to the central node for aggregation to obtain global aggregated data.

[0147] Step 714: Receive IoT device operation instructions from the central node, wherein the IoT device operation instructions are generated based on globally aggregated data.

[0148] In this embodiment, the following effects are achieved: 1. Dynamic Aggregation Strategy: A dynamic hybrid aggregation mode is adopted, which can automatically adjust the aggregation strategy according to the aggregation degree of IoT data. This flexibility makes the system more efficient in processing low-aggregation data and hot data, especially in the IoT environment where there are many and diverse data sources, making the ability to dynamically adjust the aggregation strategy particularly important. 2. Cache Management: By introducing an update mechanism that uses the total number of aggregation steps and the interval steps since the last aggregation as the hotness weight, the cache cleanup strategy can be determined based on the frequency of IoT data occurrence and its update time. This mechanism ensures that frequently accessed data is preferentially retained in the cache, which not only improves the efficiency of cache utilization but also significantly reduces unnecessary memory access operations, which is particularly important for resource-constrained IoT devices. 3. Hot Data Processing: Dynamic management can be achieved for high cardinality hot aggregation keys commonly found in IoT, avoiding the problem of hot data occupying cache resources for a long time. This greatly enhances the system's responsiveness and processing efficiency in the face of a large number of concurrent requests or data surges. 4. Adaptability: The aggregation strategy can be dynamically adjusted according to different aggregation characteristics of IoT data (such as high aggregation degree, low aggregation degree, or a mixture of both). This flexibility not only helps improve the efficiency of data processing, but also better supports the diverse needs of IoT applications, such as real-time monitoring and data analysis.

[0149] Corresponding to the above method embodiments, this disclosure also provides a distributed system embodiment. Figure 8 shows a schematic diagram of the structure of a distributed system provided by an embodiment of this disclosure. As shown in Figure 8, the distributed system 800 includes a local node 810 and a central node 820. The local node 810 is used to acquire multiple data to be aggregated, aggregate the multiple data to be aggregated to obtain local aggregated data, and update the heat weight of the first key. According to the heat weight of each data in the local node 810 corresponding to at least one key, a target key is determined from at least one key, and the target data corresponding to the target key is transmitted to the central node 820. Here, any data to be aggregated corresponds to the first key, and at least one key includes the first key. The central node 820 is used to receive the target data corresponding to the target key transmitted by the local node 810, aggregate the target data, and obtain global aggregated data.

[0150] In this embodiment of the disclosure, for the "one-stage aggregation" strategy and the "two-stage aggregation" strategy in the distributed system, a key heat weighting mechanism is introduced on the local node to reflect the frequency of key aggregation over a period of time. Based on the heat weight of each data corresponding to at least one key in the local node, the target key is determined from at least one key. The target keys with high and low cardinality are dynamically and flexibly distinguished. The target data corresponding to the target key is transmitted to the central node, where global aggregation is performed to obtain global aggregated data. A suitable aggregation strategy is then executed. This not only improves the efficiency of local aggregation and reduces unnecessary data transmission overhead, but also significantly improves the speed and efficiency of data aggregation in the distributed system.

[0151] The above is an illustrative scheme of a distributed system according to this embodiment. It should be noted that the technical solution of this distributed system and the technical solution of the data aggregation method described above belong to the same concept. For details not described in detail in the technical solution of the distributed system, please refer to the description of the technical solution of the data aggregation method described above.

[0152] Corresponding to the above method embodiments, this disclosure also provides a data aggregation device embodiment. Figure 9 shows a schematic diagram of the structure of a data aggregation device provided in one embodiment of this disclosure. As shown in Figure 9, the device is applied to a local node of a distributed system, which also includes a central node. The device includes: an acquisition module 902 configured to acquire multiple data to be aggregated, wherein any data to be aggregated corresponds to a first key; a local aggregation module 904 configured to aggregate the multiple data to be aggregated to obtain local aggregated data and update the heat weight of the first key; a determination module 906 configured to determine a target key from at least one key based on the heat weight of each data in the local node corresponding to at least one key, wherein at least one key includes the first key; and a first global aggregation module 908 configured to transmit the target data corresponding to the target key to the central node for aggregation to obtain global aggregated data.

[0153] Optionally, the local node includes a cache space; the device further includes a writing module configured to write multiple data to be aggregated into the cache space; correspondingly, the determining module 906 is further configured to: when the size of the cache space reaches a preset threshold, determine a target key from at least one key according to the heat weight of each data in the cache space corresponding to at least one key; correspondingly, the first global aggregation module 908 is further configured to: retrieve the target data corresponding to the target key from the cache space and transmit it to the central node for aggregation to obtain global aggregated data.

[0154] Optionally, the cache space is pre-divided into multiple cache subspaces, each of which has a corresponding space capacity; correspondingly, the write module is further configured to: write multiple data to be aggregated into the target cache subspace where the first key is located; and expand the target cache subspace when the space size of the target cache subspace reaches the corresponding space capacity.

[0155] Optionally, the popularity weight of the first key includes the total number of aggregation steps and the interval steps since the last aggregation; correspondingly, the local aggregation module 904 is further configured to: if the first key is appearing for the first time, initialize the total number of aggregation steps and the interval steps of the first key; if the first key is not appearing for the first time, increment the total number of aggregation steps of the first key by one, and update the interval steps of the first key to the current number of aggregation steps, wherein the current number of aggregation steps is incremented by one after aggregating the data to be aggregated corresponding to any one of the at least one keys.

[0156] Optionally, the heat weight includes the total number of aggregation steps; correspondingly, the determining module 906 is further configured to: when the size of the cache space reaches a preset threshold, determine the quantile of the total number of aggregation steps according to the preset aggregation degree; based on the quantile of the total number of aggregation steps, traverse the total number of aggregation steps of at least one key in the cache space, and determine the target key from at least one key.

[0157] Optionally, the determining module 906 is further configured to: divide the numerical range of the total aggregation steps of at least one key in the cache space to obtain multiple numerical sub-ranges; count the number of total aggregation steps in the multiple numerical sub-ranges, and estimate the target numerical sub-range where the quantile of the total aggregation steps is located based on the number and a preset aggregation degree; and determine the median of the target numerical sub-range as the quantile of the total aggregation steps.

[0158] Optionally, the heat weight includes the total number of aggregation steps; correspondingly, the determination module 906 is further configured to: when the size of the cache space reaches a preset threshold, according to the preset aggregation degree, traverse the total number of aggregation steps of at least one key in the cache space, and determine the target key from at least one key.

[0159] Optionally, the heat weight includes the number of interval steps since the last aggregation; correspondingly, the determination module 906 is further configured to: when the size of the cache space reaches a preset threshold, traverse the interval steps of at least one key according to the preset interval step threshold, and determine the target key from at least one key.

[0160] Optionally, the device further includes: a second global aggregation module, configured to determine whether the frequency of the first key exceeds a preset threshold; if the frequency of the first key does not exceed the preset threshold, to transmit multiple data to be aggregated to a central node for aggregation to obtain global aggregated data; correspondingly, the first global aggregation module 908 is further configured to: if the frequency of the first key exceeds the preset threshold, to aggregate multiple data to be aggregated to obtain local aggregated data, and update the popularity weight of the first key.

[0161] In this embodiment of the disclosure, by introducing a key heat weighting mechanism on the local node, the frequency of key aggregation over a period of time is reflected. Based on the heat weight of each data corresponding to at least one key in the local node, the target key is determined from at least one key. The target keys with high and low cardinality are dynamically and flexibly distinguished. The target data corresponding to the target key is transmitted to the central node for global aggregation to obtain global aggregated data. A suitable aggregation strategy is then executed, which not only improves the efficiency of local aggregation and reduces unnecessary data transmission overhead, but also significantly improves the speed and efficiency of data aggregation in the distributed system.

[0162] The above is an illustrative scheme of a data aggregation device according to this embodiment. It should be noted that the technical solution of this data aggregation device and the technical solution of the data aggregation method described above belong to the same concept. For details not described in detail in the technical solution of the data aggregation device, please refer to the description of the technical solution of the data aggregation method described above.

[0163] Corresponding to the above method embodiments, this disclosure also provides a data aggregation device embodiment. Figure 10 shows a schematic diagram of another data aggregation device provided in an embodiment of this disclosure. As shown in Figure 10, the device is applied to the central node of a distributed system. The distributed system also includes local nodes, including: a receiving module 1002, configured to receive target data corresponding to a target key transmitted by the local node. The step of determining the target key at the local node includes: acquiring multiple data to be aggregated, aggregating the multiple data to be aggregated to obtain local aggregated data, and updating the heat weight of the first key. Based on the heat weight of each data corresponding to at least one key in the local node, the target key is determined from at least one key, wherein any data to be aggregated corresponds to the first key, and at least one key includes the first key; and a global aggregation module 1004, configured to aggregate the target data to obtain global aggregated data.

[0164] In this embodiment of the disclosure, a hot weight mechanism for keys is introduced on the local node to reflect the frequency of aggregation of keys over a period of time. Based on the hot weight of each data in the local node corresponding to at least one key, a target key is determined from at least one key. The target keys with high and low cardinality are dynamically and flexibly distinguished. The target data corresponding to the target key is transmitted to the central node, where global aggregation is performed to obtain global aggregated data. A suitable aggregation strategy is then executed. This not only improves the efficiency of local aggregation and reduces unnecessary data transmission overhead, but also significantly improves the speed and efficiency of data aggregation in the distributed system.

[0165] The above is an illustrative scheme of another data aggregation device in this embodiment. It should be noted that the technical solution of this data aggregation device and the technical solution of the other data aggregation method described above belong to the same concept. For details not described in detail in the technical solution of the data aggregation device, please refer to the description of the technical solution of the data aggregation method described above.

[0166] Figure 11 shows a structural block diagram of a computing device according to an embodiment of the present disclosure. The components of the computing device 1100 include, but are not limited to, a memory 1110 and a processor 1120. The processor 1120 is connected to the memory 1110 via a bus 1130, and a database 1150 is used to store data.

[0167] The computing device 1100 also includes an access device 1140, which enables the computing device 1100 to communicate via one or more networks 1160. Examples of these networks include a Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the Internet. The access device 1140 may include one or more of any type of wired or wireless network interface (e.g., a Network Interface Controller (NIC)), such as an IEEE 802.11 Wireless Local Area Network (WLAN) interface, a Worldwide Interoperability for Microwave Access (Wi-MAX) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, or a Near Field Communication (NFC) interface.

[0168] In one embodiment of this disclosure, the aforementioned components of the computing device 1100, as well as other components not shown in FIG11, may be interconnected, for example, via a bus. It should be understood that the computing device block diagram shown in FIG11 is merely for illustrative purposes and is not intended to limit the scope of this disclosure. Those skilled in the art can add or replace other components as needed.

[0169] The computing device 1100 can be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (e.g., tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (e.g., smartphones), wearable computing devices (e.g., smartwatches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or personal computers (PCs). The computing device 1100 can also be a mobile or stationary server.

[0170] The processor 1120 is used to execute the following computer program / instructions, which, when executed by the processor, implement the steps of the above-described data aggregation method.

[0171] The above is an illustrative scheme of a computing device according to this embodiment. It should be noted that the technical solution of this computing device and the technical solution of the data aggregation method described above belong to the same concept. For details not described in detail in the technical solution of the computing device, please refer to the description of the technical solution of the data aggregation method described above.

[0172] This disclosure also provides a computer-readable storage medium storing a computer program / instructions that, when executed by a processor, implement the steps of the data aggregation method described above. The above is an illustrative scheme of a computer-readable storage medium according to this embodiment. It should be noted that the technical solution of this storage medium and the technical solution of the data aggregation method described above belong to the same concept; details not described in detail in the technical solution of the storage medium can be found in the description of the technical solution of the data aggregation method described above.

[0173] An embodiment of this disclosure also provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the steps of the data aggregation method described above.

[0174] The above is an illustrative scheme of a computer program product according to this embodiment. It should be noted that the technical solution of this computer program product and the technical solution of the data aggregation method described above belong to the same concept. For details not described in detail in the technical solution of the computer program product, please refer to the description of the technical solution of the data aggregation method described above.

[0175] The foregoing has described specific embodiments of this disclosure. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired results. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

[0176] The computer instructions include computer program code, which may be in the form of source code, object code, executable file, or certain intermediate forms. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium may be appropriately added or removed according to the requirements of patent practice. For example, in some regions, according to patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.

[0177] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of this disclosure are not limited to the described order of actions, because according to the embodiments of this disclosure, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to the embodiments of this disclosure.

[0178] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0179] The preferred embodiments disclosed above are merely illustrative of this disclosure. The optional embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the embodiments of this disclosure. These embodiments are selected and specifically described in this disclosure to better explain the principles and practical applications of the embodiments of this disclosure, thereby enabling those skilled in the art to better understand and utilize this disclosure. This disclosure is limited only by the claims and their full scope and equivalents.

Claims

1. A data aggregation method applied to local nodes of a distributed system, wherein the distributed system further includes a central node, the method comprising: Retrieve multiple data sets to be aggregated, where any one of the data sets corresponds to the first key; The multiple data sets to be aggregated are aggregated to obtain local aggregated data, and the popularity weight of the first key is updated. Based on the heat weight of each data in the local node corresponding to at least one key, a target key is determined from the at least one key, wherein the at least one key includes the first key; The target data corresponding to the target key is transmitted to the central node for aggregation to obtain global aggregated data.

2. The method according to claim 1, wherein the local node includes a cache space; After acquiring multiple data sets to be aggregated, the process also includes: Write the multiple data to be aggregated into the cache space; The step of determining the target key from the at least one key based on the heat weight of each data in the local node includes: When the size of the cache space reaches a preset threshold, the target key is determined from the at least one key according to the heat weight of each data in the cache space corresponding to at least one key; The step of transmitting the target data corresponding to the target key to the central node for aggregation to obtain global aggregated data includes: The target data corresponding to the target key is retrieved from the cache space and transmitted to the central node for aggregation to obtain global aggregated data.

3. The method according to claim 2, wherein the cache space is pre-divided into multiple cache subspaces, and each cache subspace has a corresponding space capacity; The step of writing the plurality of data to be aggregated into the cache space on the local node includes: Write the plurality of data to be aggregated into the target cache subspace where the first key is located; When the target cache subspace reaches its corresponding capacity, the target cache subspace is expanded.

4. The method according to claim 1, wherein the heat weight of the first key includes the total number of aggregation steps and the number of steps since the last aggregation; The update of the popularity weight of the first key includes: If the first key is appearing for the first time, initialize the total number of aggregation steps for the first key and the interval number for the first key; If the first key is not appearing for the first time, the total number of aggregation steps for the first key is incremented by one, and the interval step number for the first key is updated to the current aggregation step number, wherein the current aggregation step number is incremented by one after aggregating the data to be aggregated corresponding to any one of the at least one keys.

5. The method according to any one of claims 2-3, wherein the heat weight includes the total number of aggregation steps; When the size of the cache space reaches a preset threshold, a target key is determined from the at least one key based on the popularity weight of each data item in the cache space, including: When the size of the cache space reaches a preset threshold, the total number of aggregation steps quantiles is determined according to the preset aggregation degree. Based on the aggregate total step quantile, the aggregate total step of at least one key in the cache space is traversed to determine the target key from the at least one key.

6. The method according to claim 5, wherein determining the total number of aggregation steps quantiles based on a preset aggregation degree includes: The numerical range of the total number of aggregation steps for at least one key in the cache space is divided to obtain multiple numerical subranges; The total number of aggregation steps in the multiple numerical sub-ranges is counted, and the target numerical sub-range in which the total aggregation step quantile is located is estimated based on the count and the preset aggregation degree. The median of the target numerical subrange is determined as the quantile of the total aggregation steps.

7. The method according to any one of claims 2-4, wherein the heat weight includes the total number of aggregation steps; When the size of the cache space reaches a preset threshold, a target key is determined from the at least one key based on the popularity weight of each data item in the cache space, including: When the size of the cache space reaches a preset threshold, the target key is determined from the at least one key by traversing the total number of aggregation steps of the at least one key in the cache space according to the preset aggregation degree.

8. The method according to any one of claims 2-4, wherein the heat weight includes the number of interval steps since the last aggregation; When the size of the cache space reaches a preset threshold, a target key is determined from the at least one key based on the popularity weight of each data item in the cache space, including: When the size of the cache space reaches a preset threshold, the target key is determined from the at least one key by traversing the interval steps of the at least one key according to the preset interval step threshold.

9. The method according to claim 1, further comprising, after acquiring the plurality of data to be aggregated: Determine whether the frequency of the first key exceeds a preset threshold; If the frequency of the first key does not exceed the preset threshold, the multiple data to be aggregated are transmitted to the central node for aggregation to obtain global aggregated data; The process of aggregating the multiple datasets to be aggregated, obtaining local aggregated data, and updating the popularity weight of the first key includes: If the frequency of the first key exceeds the preset threshold, the multiple data to be aggregated are aggregated to obtain local aggregated data, and the popularity weight of the first key is updated.

10. A data aggregation method applied to a central node of a distributed system, wherein the distributed system further includes local nodes, the method comprising: Receiving target data corresponding to a target key transmitted by the local node, wherein the step of determining the target key at the local node includes: acquiring multiple data to be aggregated, aggregating the multiple data to be aggregated to obtain local aggregated data, updating the heat weight of the first key, and determining the target key from the at least one key according to the heat weight of each data in the local node corresponding to at least one key, wherein any data to be aggregated corresponds to the first key, and the at least one key includes the first key; The target data is aggregated to obtain global aggregated data.

11. A distributed system, comprising local nodes and a central node; The local node is used to acquire multiple data sets to be aggregated, aggregate the data sets to obtain local aggregated data, update the popularity weight of the first key, determine the target key from the at least one key based on the popularity weight of each data set in the local node, and transmit the target data corresponding to the target key to the central node. Any data to be aggregated corresponds to the first key, and the at least one key includes the first key; The central node is used to receive the target data corresponding to the target key transmitted by the local node, aggregate the target data, and obtain global aggregated data.

12. A computing device, comprising: Memory and processor; The memory is used to store computer programs / instructions, and the processor is used to execute the computer programs / instructions, which, when executed by the processor, perform the following operations: Retrieve multiple data sets to be aggregated, where any one of the data sets corresponds to the first key; The multiple data sets to be aggregated are aggregated to obtain local aggregated data, and the popularity weight of the first key is updated. Based on the heat weight of each data corresponding to at least one key in the local node of the distributed system, a target key is determined from the at least one key, wherein the at least one key includes the first key, and the distributed system also includes a central node; The target data corresponding to the target key is transmitted to the central node for aggregation to obtain global aggregated data.

13. The computing device of claim 12, wherein the local node includes a cache space; After acquiring multiple data sets to be aggregated, the process also includes: Write the multiple data to be aggregated into the cache space; The step of determining the target key from the at least one key based on the heat weight of each data in the local node includes: When the size of the cache space reaches a preset threshold, the target key is determined from the at least one key according to the heat weight of each data in the cache space corresponding to at least one key; The step of transmitting the target data corresponding to the target key to the central node for aggregation to obtain global aggregated data includes: The target data corresponding to the target key is retrieved from the cache space and transmitted to the central node for aggregation to obtain global aggregated data.

14. The computing device according to claim 13, wherein the cache space is pre-divided into multiple cache subspaces, and each cache subspace has a corresponding space capacity; The step of writing the plurality of data to be aggregated into the cache space on the local node includes: Write the plurality of data to be aggregated into the target cache subspace where the first key is located; When the target cache subspace reaches its corresponding capacity, the target cache subspace is expanded.

15. The computing device according to claim 12, wherein the heat weight of the first key includes the total number of aggregation steps and the number of steps since the last aggregation; The update of the popularity weight of the first key includes: If the first key is appearing for the first time, initialize the total number of aggregation steps for the first key and the interval number for the first key; If the first key is not appearing for the first time, the total number of aggregation steps for the first key is incremented by one, and the interval step number for the first key is updated to the current aggregation step number, wherein the current aggregation step number is incremented by one after aggregating the data to be aggregated corresponding to any one of the at least one keys.

16. The computing device according to any one of claims 13-14, wherein the heat weight includes the total number of aggregation steps; When the size of the cache space reaches a preset threshold, a target key is determined from the at least one key based on the popularity weight of each data item in the cache space, including: When the size of the cache space reaches a preset threshold, the total number of aggregation steps quantiles is determined according to the preset aggregation degree. Based on the aggregate total step quantile, the aggregate total step of at least one key in the cache space is traversed to determine the target key from the at least one key.

17. The computing device according to claim 16, wherein determining the total aggregation step quantile based on a preset aggregation degree comprises: The numerical range of the total number of aggregation steps for at least one key in the cache space is divided to obtain multiple numerical subranges; The total number of aggregation steps in the multiple numerical sub-ranges is counted, and the target numerical sub-range in which the total aggregation step quantile is located is estimated based on the count and the preset aggregation degree. The median of the target numerical subrange is determined as the quantile of the total aggregation steps.

18. A computing device, comprising: Memory and processor; The memory is used to store computer programs / instructions, and the processor is used to execute the computer programs / instructions, which, when executed by the processor, perform the following operations: Receiving target data corresponding to a target key transmitted by a local node in a distributed system, wherein the step of determining the target key at the local node includes: acquiring multiple data to be aggregated, aggregating the multiple data to be aggregated to obtain local aggregated data, updating the heat weight of a first key, and determining a target key from the at least one key according to the heat weight of each data in the local node corresponding to at least one key, wherein any data to be aggregated corresponds to the first key, and the at least one key includes the first key; The target data is aggregated to obtain global aggregated data.

19. A computer-readable storage medium storing a computer program / instructions that, when executed by a processor, implement the steps of the method according to any one of claims 1 to 10.

20. A computer program product comprising a computer program / instructions that, when executed by a processor, implement the steps of the method according to any one of claims 1 to 10.