A data processing system, method and apparatus
By introducing an aggregation module into the data processing system to aggregate the intermediate results of map nodes, generate aggregation files, and assign reduce tasks, the problem of high communication pressure between upstream and downstream nodes is solved, and data processing efficiency and resource utilization are improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING KINGSOFT CLOUD NETWORK TECH CO LTD
- Filing Date
- 2022-01-06
- Publication Date
- 2026-06-26
AI Technical Summary
In existing technologies, the communication pressure between upstream and downstream nodes is relatively high, resulting in low communication resource consumption and low data processing efficiency.
The aggregation module aggregates the intermediate results of the map nodes to generate an aggregation file, and then distributes the reduce tasks to the reduce nodes, reducing the communication requirements of the reduce nodes.
It effectively saves communication resources of reduce nodes, improves data processing efficiency, and reduces communication burden.
Smart Images

Figure CN116450326B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing technology, and in particular to a data processing system, method and apparatus. Background Technology
[0002] As human society fully enters the information age, data has become a strategic resource as important as water and oil. By mining big data, governments and enterprises can make decisions based on more scientific evidence, thereby improving decision-making efficiency, crisis response capabilities, and the level of public services.
[0003] Big data mining typically requires the cooperation of multiple nodes. Since different nodes have different functions, data transmission between nodes needs to meet both data transfer and data distribution requirements. For example, nodes a, b, and c are upstream nodes, and node A is a downstream node. Nodes a, b, and c all output data that needs to be processed by node A. This means that node A needs to communicate with nodes a, b, and c separately, resulting in significant communication pressure on node A. Summary of the Invention
[0004] This application provides a data processing system, method, and apparatus to solve the problem of high communication pressure between upstream and downstream nodes in the prior art.
[0005] In a first aspect, this application provides a first data processing method applied to an aggregation module. The method includes: obtaining each first intermediate result of each map node, wherein each first intermediate result contains a first key, a first value, and a correspondence between the first key and the first value; generating an aggregation file, wherein the aggregation file contains several second intermediate results, each second intermediate result containing a second key, a second value, and a correspondence between the second key and the second value, wherein the second key is selected from the first keys, and any two second keys are different; the second value corresponding to the second key is: the aggregation result of the first values corresponding to the first keys that are the same as the second key in each first intermediate result; generating each reduce task based on each aggregation file; and allocating each reduce task to each reduce node.
[0006] In an optional embodiment of this specification, generating an aggregation file includes: using one of the first intermediate results for which no aggregation result has been determined as a reference first intermediate result; using the first key of the reference first intermediate result as a reference first key; finding a first value corresponding to the same first key as the reference first key from among the first intermediate results for which no aggregation result has been determined, and using it as a first value to be processed; aggregating each first value to be processed to obtain an aggregation result corresponding to the reference first key; using the reference first key as a second key, and using the aggregation result corresponding to the reference first key as a second value corresponding to the second key to obtain a second intermediate result; re-determining the reference first intermediate result; determining the second intermediate result based on the re-determined reference first intermediate result, until no reference first intermediate result can be determined from among the first intermediate results; and obtaining the aggregation file based on each second intermediate result.
[0007] In an optional embodiment of this specification, generating each reduce task based on each aggregate file includes: for each second intermediate result, if the second value of the second intermediate result is greater than a reference value, then determining the second intermediate result as a target intermediate result, wherein the reference value is obtained by multiplying a cumulative value by a specified ratio, and the cumulative value is positively correlated with the sum of each second value; splitting the target intermediate result into a first number of second intermediate sub-results, wherein the first number is not greater than the number of reduce nodes; and generating each reduce task based on each second intermediate sub-result.
[0008] In an optional embodiment of this specification, generating each reduce task based on each aggregate file includes: for each second intermediate result, if the second value of the second intermediate result is greater than a reference value, then determining the second intermediate result as a target intermediate result, wherein the reference value is obtained by multiplying a cumulative value by a specified ratio, and the cumulative value is positively correlated with the sum of each second value; allocating the target intermediate result to reduce nodes; receiving a task addition instruction from a reduce node, wherein the task addition instruction carries a first number of second intermediate sub-results, the second intermediate sub-results being obtained by splitting the target intermediate result, wherein the first number is not greater than the number of reduce nodes; and generating each reduce task based on each second intermediate sub-result.
[0009] In an optional embodiment of this specification, the first key represents the file content of the first intermediate result to which it belongs, and the first value represents the file size of the first intermediate result to which it belongs.
[0010] Secondly, this application provides a second data processing method applied to a reduce node. The method includes: receiving a reduce task from an aggregation module, wherein the reduce task is obtained from an aggregation file containing several second intermediate results. Each second intermediate result contains a second key, a second value, and a correspondence between the second key and the second value. The second key is selected from the first key, and any two second keys are different. The second value corresponding to the second key is the aggregation result of the first value corresponding to the first key that is the same as the second key in each first intermediate result. The first intermediate result contains a first key, a first value, and a correspondence between the first key and the first value. The first intermediate result is output by a map node; executing the reduce task to generate a reduce result.
[0011] In an optional embodiment of this specification, receiving a reduce task from an aggregation module includes: receiving a target intermediate result from the aggregation module, wherein the target intermediate result is a second intermediate result belonging to a second value that is greater than a reference value, the reference value being obtained by multiplying a cumulative value by a specified ratio, and the cumulative value being positively correlated with the sum of all second values; splitting the target intermediate result into a first number of second intermediate sub-results, wherein the first number is not greater than the number of reduce nodes; generating a task addition instruction, wherein the task addition instruction adds the first number of second intermediate sub-results; sending the task addition instruction to the aggregation module; and receiving a reduce task from the aggregation module, wherein the reduce task is generated according to the task addition instruction.
[0012] Thirdly, this application provides a first data processing apparatus applied to an aggregation module. The apparatus includes: a first intermediate result acquisition module configured to acquire each first intermediate result of each map node, wherein each first intermediate result contains a first key, a first value, and a correspondence between the first key and the first value; an aggregation file generation module configured to generate an aggregation file, wherein the aggregation file contains several second intermediate results, each second intermediate result containing a second key, a second value, and a correspondence between the second key and the second value, wherein the second key is selected from the first keys, and any two second keys are different; the second value corresponding to the second key is the aggregation result of the first value corresponding to the first key that is the same as the second key in each first intermediate result; a reduce task generation module configured to generate each reduce task based on each aggregation file; and an allocation module configured to allocate each reduce task to each reduce node.
[0013] Fourthly, this application provides a second data processing apparatus applied to a reduce node. The apparatus includes: a reduce task receiving module configured to receive reduce tasks from an aggregation module, wherein the reduce task is obtained from an aggregation file containing several second intermediate results. Each second intermediate result contains a second key, a second value, and a correspondence between the second key and the second value. The second key is selected from the first key, and any two second keys are different. The second value corresponding to the second key is the aggregation result of the first value corresponding to the first key that is the same as the second key in each first intermediate result. The first intermediate result contains a first key, a first value, and a correspondence between the first key and the first value. The first intermediate result is output by a map node. A reduce result generation module configured to execute the reduce task and generate reduce results.
[0014] Fifthly, this application provides a data processing system, including: a map node cluster, an aggregation module, and a reduce node cluster. The aggregation module is communicatively connected to each map node in the map node cluster, and also communicatively connected to each reduce node in the reduce node cluster. The map node cluster is configured to: process raw data to obtain first intermediate results; the aggregation module is configured to: obtain each first intermediate result from each map node, wherein each first intermediate result contains a first key, a first value, and a correspondence between the first key and the first value; generate an aggregation file, wherein the aggregation file contains several second intermediate results, each second intermediate result containing a second key, a second value, and a correspondence between the second key and the second value, wherein the second key is selected from the first keys, and any two second keys are different; the second value corresponding to the second key is: the aggregation result of the first value corresponding to the first key that is the same as the second key in each first intermediate result; generate each reduce task according to each aggregation file; and allocate each reduce task to each reduce node; the reduce node cluster is configured to: receive the reduce tasks from the aggregation module, execute the reduce tasks, and generate reduce results.
[0015] In an optional embodiment of this specification, the map node cluster configuration is further as follows: the original data is serialized to obtain data to be processed; the data to be processed is divided into a second number of data blocks, wherein the second number is the number of map nodes in the map node cluster, and each map node in the map node cluster processes each data block to obtain each first intermediate result.
[0016] Sixthly, this application provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus;
[0017] Memory, used to store computer programs;
[0018] A processor, when executing a program stored in memory, implements the steps of any data processing method in the first aspect, or implements the steps of any data processing method in the second aspect.
[0019] In a seventh aspect, this application provides a computer-readable storage medium having a computer program stored thereon, characterized in that, when the computer program is executed by a processor, it implements the steps of any data processing method in the first aspect, or the steps of any data processing method in the second aspect.
[0020] The technical solutions provided in this application have the following advantages compared with the prior art:
[0021] The data processing method and apparatus described in this specification involve an aggregation module that aggregates the first intermediate results output by each map node to obtain an aggregation file containing several second intermediate results. Then, reduce tasks are generated based on the second intermediate results in the aggregation file and assigned to reduce nodes. For a reduce node, obtaining a reduce task only requires one communication, effectively saving communication resources. Attached Figure Description
[0022] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with the invention and, together with the description, serve to explain the principles of the invention.
[0023] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0024] Figure 1 This is a schematic diagram illustrating the communication between upstream and downstream nodes in related technologies;
[0025] Figure 2 This application provides a schematic diagram of a data processing system architecture;
[0026] Figure 3 A flowchart illustrating a data processing procedure provided in an embodiment of this application;
[0027] Figure 4 This application provides a schematic diagram of a process for generating an aggregate file.
[0028] Figure 5 This is a schematic diagram of a first type of data processing apparatus provided in an embodiment of this application;
[0029] Figure 6 This is a schematic diagram of a second data processing device provided in an embodiment of this application;
[0030] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0031] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0032] In related technologies, big data mining typically requires cooperation between upstream and downstream nodes. However, when upstream and downstream nodes have different functions, a strict one-to-one correspondence may not exist between them. For example... Figure 1 In the scenario shown, nodes a, b, and c are upstream nodes, and nodes A and B are downstream nodes. Node a outputs intermediate results a1 and a2, node b outputs intermediate results b1 and b2, and node c outputs intermediate result c1. Intermediate results a1, b1, and c1 need to be processed by node A, while intermediate results a2 and b2 need to be processed by node B. For node A, it needs to communicate with nodes a, b, and c three times to obtain intermediate results a1, b1, and c1; for node B, it needs to communicate with nodes a and b twice to obtain intermediate results a2 and b2.
[0033] If there are many upstream nodes, the communication between downstream and upstream nodes will become more complex, specifically manifested in frequent communication between them. This frequent communication consumes node communication resources and also impacts data processing efficiency.
[0034] In view of this, this specification provides a data processing system and method to address, to some extent, the technical problem of high communication pressure between upstream and downstream nodes in related technologies.
[0035] The data processing system described in this specification includes: a map node cluster, an aggregation module, and a reduce node cluster. The aggregation module is communicatively connected to each map node in the map node cluster. (The last sentence is a simplified illustration and can be left as is.) Figure 2 As shown, map nodes 1 to k are connected to the aggregation module, and reduce nodes 1 to n are also connected to the aggregation module. This data processing system performs the following steps during data processing:
[0036] S300: The map node processes the raw data to obtain the first intermediate results.
[0037] The data processing system described in this specification may contain multiple map nodes. Each map node may execute this step to obtain the first intermediate result corresponding to that map node.
[0038] For ease of illustration, it can be shown that the first intermediate result output by a map node can be one or more. Each first intermediate result contains a first key, a first value, and the correspondence between the first key and the first value.
[0039] For example, map node 1 outputs first intermediate result 11 and first intermediate result 21. First intermediate result 11 contains the first key 11, the first value 11, and the correspondence between the first key 11 and the first value 11. First intermediate result 21 contains the first key 21, the first value 21, and the correspondence between the first key 21 and the first value 21. Map node 2 outputs first intermediate result 12. First intermediate result 12 contains the first key 12, the first value 12, and the correspondence between the first key 12 and the first value 12.
[0040] The first keys of the first intermediate results output by the same map node are all different; that is, first key 11 and first key 21 are different. The first keys of the first intermediate results output by different map nodes may be the same. That is, first key 11 may be the same as first key 12. In the following description, we will take the case where first key 11 and first key 12 are the same as an example.
[0041] First intermediate results output by the same map node can be distinguished by their first key. First intermediate results output by different map nodes can be aggregated by their first key (e.g., aggregate first intermediate results with the same first key).
[0042] In an optional embodiment of this specification, the first key may be obtained from the content of the original data to identify the first intermediate result. The first value may be obtained from the size of the original data.
[0043] It should be noted that this manual does not impose a specific limit on the number of map nodes included in the data processing system. In actual data processing, the number of map nodes in the data processing system can be increased or decreased according to actual needs.
[0044] S302: The aggregation module obtains the first intermediate results of each map node.
[0045] In this specification, the aggregation module communicates with each map node separately, so the aggregation module can obtain each first intermediate result through the communication link with the map node.
[0046] S304: The aggregation module generates aggregation files.
[0047] The aggregation file in this specification contains several second intermediate results. Each second intermediate result contains a second key, a second value, and the correspondence between the second key and the second value. The second key is selected from the first key. Any two second keys are different. The second value corresponding to the second key is the aggregation result of the first value corresponding to the first key that is the same as the second key in each first intermediate result.
[0048] The second keys of different intermediate results are all different, and the second key serves to distinguish each intermediate result. Different second keys may correspond to the same second value.
[0049] In this specification, aggregation means combining the first values of all first intermediate results with the same first key. For example, the first key 11 of first intermediate result 11 and the first key 12 of first intermediate result 12 are the same, both being ABC. The first value 11 of first intermediate result 11 is 8, and the first value 12 of first intermediate result 12 is 10. Aggregation of the first values 11 and 12 can be achieved by summing them, resulting in the second value 18. Alternatively, aggregation can be performed in other ways, such as averaging, resulting in the second value 9. Or, the aggregation result can be an array of first values with the same first key, in which case the second value can be represented as (8, 10).
[0050] In an optional embodiment of this specification, the aggregation file may be a dictionary file that records the various second intermediate results.
[0051] S306: The aggregation module generates each reduce task based on each aggregation file.
[0052] The second intermediate result obtained through the above steps is the smallest unit for generating reduce tasks. In an optional embodiment of this specification, the second intermediate result corresponds one-to-one with a reduce task. For example, a second intermediate result can be directly used as a reduce task; in this case, a reduce task is a second intermediate result that can be assigned to a reduce node.
[0053] Specifically, the process of the aggregation module generating reduce tasks can be as follows: read the aggregation file and identify each second intermediate result in the aggregation file. For each identified second intermediate result, generate the corresponding reduce task.
[0054] S308: The aggregation module distributes each reduce task to each reduce node.
[0055] After generating each reduce task, the aggregation module distributes the reduce tasks to each reduce node, so that each reduce node executes at least one reduce task.
[0056] As can be seen, in the data processing of this specification, for the reduce node, communication is required when the reduce node obtains a reduce task from the aggregation module. However, during the generation of the reduce task, the communication resources of the reduce node can be largely released. For example, if reduce node 1 only receives one reduce task 1, then reduce node 1 only needs to communicate once to obtain reduce task 1. Even if reduce task 1 is obtained based on the first value 11 and the first value 12, it is not necessary to communicate separately for the first value 11 and the first value 12 as in related technologies.
[0057] Furthermore, in the data processing of this specification, the aggregation of the first value 11 and the first value 12 is performed by the aggregation module, and the reduce node does not need to perform the aggregation step, which can effectively save the data processing resources of the reduce node.
[0058] S310: The reduce node receives reduce tasks from the aggregation module.
[0059] S312: The reduce node executes the reduce task and generates the reduce result.
[0060] As can be seen, the data processing method and system in this specification involve an aggregation module that aggregates the first intermediate results output by each map node to obtain an aggregation file containing several second intermediate results. Then, reduce tasks are generated based on the second intermediate results in the aggregation file and assigned to reduce nodes. For a reduce node, obtaining a reduce task only requires one communication, effectively saving communication resources.
[0061] As described above, in this specification, the aggregation operation is mainly performed by the aggregation module. The following describes how the aggregation module specifically performs aggregation in one embodiment. In this embodiment, the aggregation module performs the following steps:
[0062] S400: The aggregation module uses one of the first intermediate results for which no aggregation result has been determined as the reference first intermediate result. If no reference first intermediate result can be determined, then proceed to step S410.
[0063] In an optional embodiment of this specification, the aggregation module maintains an intermediate file containing several rows and three columns corresponding to each row. Each row can consist of several fields. Each row corresponds one-to-one with a first intermediate result. After the aggregation module receives a first intermediate result, it identifies a row from the rows that has not yet received a first intermediate result (e.g., if a row is empty, it is determined that the row has not received a first intermediate result) as the target row. The first key of the first intermediate result is added to the field corresponding to the first column of the target row, and the first value of the first intermediate result is added to the field corresponding to the second column of the target row. This process continues until all received first intermediate results are added to the intermediate file. A first character is added to the field corresponding to the third column of each row. The first character indicates that the corresponding first intermediate result has not yet yielded an aggregation result.
[0064] During aggregation, the aggregation module reads each line from the intermediate file. If the first character is added to the field corresponding to the third column of that line, it determines that the first intermediate result recorded in that line is a reference to the first intermediate result. Then, a second character is added to the field corresponding to the reference to the first intermediate result in the third column. The first character indicates that the corresponding first intermediate result has determined the aggregation result.
[0065] In another optional embodiment of this specification, during the aggregation process, after each aggregation result corresponding to a reference first intermediate result is determined, the reference first intermediate result is deleted, and the remaining first intermediate results are all first intermediate results for which no aggregation result has been determined.
[0066] S402: The aggregation module uses the first key of the reference first intermediate result as the reference first key.
[0067] S404: The aggregation module finds the first value corresponding to the first key that is the same as the reference first key among the first intermediate results of the aggregation result that has never been determined, and uses it as the first value to be processed.
[0068] At this point, since the aggregation result has not yet been determined by referring to the first intermediate result, the first value of referring to the first intermediate result is also the first value to be processed.
[0069] S406: The aggregation module aggregates each first value to be processed to obtain an aggregation result corresponding to the reference first key.
[0070] At this point, the aggregated result corresponding to the first intermediate result to which each first value to be processed belongs has been determined. In the aforementioned optional embodiment, the characters filled in the field corresponding to the third column of the intermediate file can be updated (specifically, the second character can be filled in the field corresponding to the third column of the first intermediate result to which each first value to be processed belongs); or, the first intermediate result to which each first value to be processed belongs can be deleted.
[0071] S408: The aggregation module uses the reference first key as the second key and the aggregation result corresponding to the reference first key as the second value corresponding to the second key, thus obtaining a second intermediate result. Step S400 is executed again.
[0072] S410: The aggregation module obtains the aggregation file based on each second intermediate result.
[0073] Specifically, the second key and the second value corresponding to the second key can be added to the aggregate file to be filled. After adding all the second intermediate results to the aggregate file to be filled, the aggregate file is obtained.
[0074] In this specification, the second intermediate results used to generate the reduce task are divided using a second key as an identifier. It is possible that one second key corresponds to a smaller second value, while another second key corresponds to a larger second value. In actual data processing, this phenomenon may lead to data skew between reduce nodes.
[0075] To avoid this phenomenon, in an optional embodiment of this specification, the aggregation module splits the second intermediate result with the larger second value, and then obtains the reduce task based on the split result.
[0076] Specifically, for each second intermediate result, if the second value of the second intermediate result is greater than the reference value, then the aggregation module determines the second intermediate result as the target intermediate result. The target intermediate result is then divided into a first number of second intermediate sub-results. If the second value of the second intermediate result is not greater than the reference value, then a reduce task is directly generated based on the second intermediate result.
[0077] The reference value is obtained by multiplying the cumulative value by a specified ratio. The cumulative value is positively correlated with the sum of all second values. Optionally, the sum of all second values can be directly used as the cumulative value. The first quantity is not greater than the number of reduce nodes. Each reduce task is generated based on each second intermediate sub-result.
[0078] In an optional embodiment of this specification, the specified ratio can be positively correlated with the number of currently idle reduce nodes. The specified ratio represents the disaster recovery capability of the data processing system for reduce tasks; the larger the specified ratio, the greater the disaster recovery capability.
[0079] In this specification, reduce nodes include reduce nodes in a working state and reduce nodes in an idle state. Optionally, a first number can be determined based on the number of reduce nodes in a working state, and then reduce tasks obtained from the second intermediate result obtained from the split can be assigned to each reduce node in a working state. Further optionally, the first number can be determined based on the number of reduce nodes in an idle state, and then reduce tasks obtained from the second intermediate result obtained from the split can be assigned to each reduce node in an idle state. Furthermore, further optionally, if half of the second value of the second intermediate result (i.e., half of the second value) is greater than a reference value, then the first number is determined based on the total number of reduce nodes in a working state and reduce nodes in an idle state.
[0080] In this embodiment, the splitting of the target intermediate results is performed by the aggregation module, which avoids increasing the workload of the reduce nodes and avoids the splitting steps occupying the communication resources of the reduce nodes.
[0081] To avoid data skew, in another optional embodiment of this specification, the reduce node splits the second intermediate result with the larger second value, and then the aggregation module obtains the reduce task based on the split result.
[0082] Specifically, for each second intermediate result, if the second value of the second intermediate result is greater than a reference value, then the aggregation module determines the second intermediate result as the target intermediate result. The reference value is obtained by multiplying a cumulative value by a specified ratio, and the cumulative value is positively correlated with the sum of all second values. Then, the aggregation module allocates the target intermediate result to the reduce nodes. If the second value of the second intermediate result is not greater than the reference value, then a reduce task is directly generated based on the second intermediate result.
[0083] Optionally, the aggregation module may assign the target intermediate result to one of the idle reduce nodes in each reduce node to avoid disturbing the reduce nodes that are in the working state; or, the aggregation module may assign the target intermediate result to the reduce node that is in the working state and has been in the working state for the shortest time to avoid wasting resources caused by switching the working state of the reduce node.
[0084] The reduce node receives the target intermediate result from the aggregation module. The reduce node splits the target intermediate result into a first number of second intermediate sub-results, where the first number is no greater than the number of reduce nodes. Then, the reduce node generates a task addition instruction, which includes the first number of second intermediate sub-results. The reduce node sends the task addition instruction to the aggregation module.
[0085] The aggregation module receives a task addition instruction from the reduce nodes. This instruction carries a first number of second intermediate sub-results, which are obtained by splitting the target intermediate result. The first number is no greater than the number of reduce nodes. Then, the aggregation module generates each reduce task based on each second intermediate sub-result.
[0086] In this embodiment, the splitting of the target intermediate result is performed by one of the reduce nodes. The communication burden and resource consumption caused by the splitting step are only borne by the reduce node that performs the splitting. This can avoid occupying the communication and data processing resources of other reduce nodes and can also reduce the data processing burden of the aggregation module to a certain extent.
[0087] The volume of each second intermediate sub-result obtained after splitting the target intermediate result is reduced compared to the original target intermediate result. Based on this, the data volume of the reduce task will not be too large, which can effectively avoid the phenomenon of data skew.
[0088] In further optional embodiments of this specification, consideration is given not only to reducing the communication burden on reduce nodes but also to avoiding excessive communication burden on the aggregation module. The aggregation module determines whether the number of reduce nodes it is connected to exceeds a threshold (preset value). If the determination result is yes, it indicates that the aggregation module is under significant communication pressure, and the aggregation module performs a splitting step to avoid further exacerbating the communication pressure on the aggregation module through communication with reduce nodes regarding the target intermediate results. If the determination result is no, the reduce nodes perform the splitting step.
[0089] To alleviate the storage space occupation of map nodes caused by data storage, in an optional embodiment of this specification, after obtaining the original data, the primary map node in the map node cluster serializes the original data to obtain data to be processed; the data to be processed is then divided into a second number of data blocks, where the second number is the number of map nodes (including the primary map node and non-primary map nodes) in the map node cluster. Then, the primary map node distributes the data blocks to each map node, and each map node in the map node cluster processes each data block to obtain a first intermediate result.
[0090] Following the same line of thought, this specification further provides a first data processing device, which is applied to the aggregation module, such as... Figure 5 As shown, the first type of data processing device includes one or more of the following modules:
[0091] The first intermediate result acquisition module 500 is configured to acquire each first intermediate result of each map node, wherein each first intermediate result contains a first key, a first value, and the correspondence between the first key and the first value;
[0092] The aggregation file generation module 502 is configured to generate an aggregation file, wherein the aggregation file contains a plurality of second intermediate results, each second intermediate result contains a second key, a second value, and a correspondence between the second key and the second value, the second key is selected from the first key, any two second keys are different, and the second value corresponding to the second key is: the aggregation result of the first value corresponding to the first key that is the same as the second key in each first intermediate result;
[0093] The reduce task generation module 504 is configured to generate each reduce task based on each aggregate file.
[0094] The allocation module 506 is configured to allocate each reduce task to each reduce node.
[0095] In an optional embodiment of this specification, the aggregation file generation module 502 is specifically configured as follows: taking one of the first intermediate results for which no aggregation result has been determined as a reference first intermediate result; taking the first key of the reference first intermediate result as a reference first key; finding a first value corresponding to the same first key as the reference first key among the first intermediate results for which no aggregation result has been determined, and taking it as a first value to be processed; aggregating each first value to be processed to obtain an aggregation result corresponding to the reference first key; taking the reference first key as a second key, and taking the aggregation result corresponding to the reference first key as a second value corresponding to the second key to obtain a second intermediate result; re-determining the reference first intermediate result; determining the second intermediate result based on the re-determined reference first intermediate result, until no reference first intermediate result can be determined from the first intermediate results; and obtaining the aggregation file based on each second intermediate result.
[0096] In an optional embodiment of this specification, the reduce task generation module 504 is specifically configured as follows: for each second intermediate result, if the second value of the second intermediate result is greater than a reference value, then the second intermediate result is determined as a target intermediate result, wherein the reference value is obtained by multiplying a cumulative value by a specified ratio, and the cumulative value is positively correlated with the sum of all second values; the target intermediate result is split into a first number of second intermediate sub-results, wherein the first number is not greater than the number of reduce nodes; and each reduce task is generated based on each second intermediate sub-result.
[0097] In an optional embodiment of this specification, the reduce task generation module 504 is specifically configured as follows: for each second intermediate result, if the second value of the second intermediate result is greater than a reference value, then the second intermediate result is determined as a target intermediate result, wherein the reference value is obtained by multiplying a cumulative value by a specified ratio, and the cumulative value is positively correlated with the sum of all second values; the target intermediate result is allocated to a reduce node; a task addition instruction is received from the reduce node, wherein the task addition instruction carries a first number of second intermediate sub-results, the second intermediate sub-results being obtained by splitting the target intermediate result, wherein the first number is not greater than the number of reduce nodes; and each reduce task is generated based on each second intermediate sub-result.
[0098] In an optional embodiment of this specification, the first key represents the file content of the first intermediate result to which it belongs, and the first value represents the file size of the first intermediate result to which it belongs.
[0099] Following the same approach, this specification further provides a second data processing device, which is applied to the reduce node, such as... Figure 6As shown, the second type of data processing device includes one or more of the following modules:
[0100] The reduce task receiving module 600 is configured to receive reduce tasks from the aggregation module. The reduce task is obtained from an aggregation file, which contains several second intermediate results. Each second intermediate result contains a second key, a second value, and a correspondence between the second key and the second value. The second key is selected from the first key, and any two second keys are different. The second value corresponding to the second key is the aggregation result of the first value corresponding to the first key that is the same as the second key in each first intermediate result. The first intermediate result contains a first key, a first value, and a correspondence between the first key and the first value. The first intermediate result is output by the map node.
[0101] The reduce result generation module 602 is configured to execute the reduce task and generate reduce results.
[0102] In an optional embodiment of this specification, the reduce task receiving module 600 is specifically configured to: receive a target intermediate result from an aggregation module, wherein the target intermediate result is a second intermediate result belonging to a second value greater than a reference value, the reference value being obtained by multiplying a cumulative value by a specified ratio, the cumulative value being positively correlated with the sum of all second values; split the target intermediate result into a first number of second intermediate sub-results, wherein the first number is not greater than the number of reduce nodes; generate a task addition instruction, wherein the task addition instruction includes the first number of second intermediate sub-results; send the task addition instruction to the aggregation module; and receive a reduce task from the aggregation module, wherein the reduce task is generated according to the task addition instruction.
[0103] like Figure 7 As shown in the figure, this application embodiment provides a data processing device, including a processor 111, a communication interface 112, a memory 113, and a communication bus 114, wherein the processor 111, the communication interface 112, and the memory 113 communicate with each other through the communication bus 114.
[0104] Memory 113 is used to store computer programs;
[0105] In one embodiment of this application, the processor 111, when executing a program stored in the memory 113, implements the data processing control method provided in any of the foregoing method embodiments.
[0106] This application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the data processing steps provided in any of the foregoing method embodiments.
[0107] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes the element.
[0108] The above are merely specific embodiments of the present invention, enabling those skilled in the art to understand or implement the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features claimed herein.
Claims
1. A data processing system, characterized in that, include: The system comprises a map node cluster, an aggregation module, and a reduce node cluster. The aggregation module is communicatively connected to each map node in the map node cluster, and the aggregation module is communicatively connected to each reduce node in the reduce node cluster. The map node cluster is configured to process the raw data to obtain various first intermediate results; The aggregation module is configured to: obtain each first intermediate result of each map node, wherein each first intermediate result contains a first key, a first value, and a correspondence between the first key and the first value; generate an aggregation file, wherein the aggregation file contains several second intermediate results, each second intermediate result contains a second key, a second value, and a correspondence between the second key and the second value, wherein the second key is selected from the first keys, and any two second keys are different, and the second value corresponding to the second key is: the aggregation result of the first values corresponding to the first keys that are the same as the second key in each first intermediate result; generate each reduce task according to each aggregation file; and allocate each reduce task to each reduce node. The reduce node cluster is configured to receive reduce tasks from the aggregation module, execute the reduce tasks, and generate reduce results.
2. The system according to claim 1, characterized in that, The map node cluster configuration is further configured to: serialize the original data to obtain data to be processed; divide the data to be processed into a second number of data blocks, wherein the second number is the number of map nodes in the map node cluster, and each map node in the map node cluster processes each data block to obtain each first intermediate result.
3. A data processing method, characterized in that, The method is applied to the aggregation module, and the method includes: Obtain each first intermediate result of each map node, wherein each first intermediate result contains a first key, a first value, and the correspondence between the first key and the first value; Generate an aggregation file, wherein the aggregation file contains a plurality of second intermediate results, each second intermediate result contains a second key, a second value, and a correspondence between the second key and the second value, the second key is selected from the first key, any two second keys are different, and the second value corresponding to the second key is: the aggregation result of the first value corresponding to the first key that is the same as the second key in each first intermediate result; Based on each aggregate file, generate each reduce task; Each reduce task is assigned to a reduce node.
4. The method according to claim 3, characterized in that, Generate an aggregate file, including: One of the first intermediate results for which the aggregation result has not been determined shall be used as the reference first intermediate result; The first key of the reference first intermediate result is used as the reference first key; Among the first intermediate results where the aggregation result has never been determined, find the first value corresponding to the first key that is the same as the reference first key, and use it as the first value to be processed; Aggregate each first value to be processed to obtain an aggregation result corresponding to the reference first key; Using the reference first key as the second key, and the aggregation result corresponding to the reference first key as the second value corresponding to the second key, a second intermediate result is obtained; The reference first intermediate result has been redefined; Based on the redefined reference first intermediate result, determine the second intermediate result, until the reference first intermediate result can no longer be determined from each of the first intermediate results; Based on each of the second intermediate results, the aggregate file is obtained.
5. The method according to claim 3, characterized in that, Based on the aggregate files, generate the reduce tasks, including: For each second intermediate result, if the second value of the second intermediate result is greater than the reference value, then the second intermediate result is determined as the target intermediate result, wherein the reference value is obtained by multiplying the cumulative value by a specified ratio, and the cumulative value is positively correlated with the sum of the second values; The target intermediate result is split into a first number of second intermediate sub-results, wherein the first number is not greater than the number of reduce nodes; Each reduce task is generated based on the results of each second intermediate sub-task.
6. The method according to claim 3, characterized in that, Based on the aggregate files, generate the reduce tasks, including: For each second intermediate result, if the second value of the second intermediate result is greater than the reference value, then the second intermediate result is determined as the target intermediate result, wherein the reference value is obtained by multiplying the cumulative value by a specified ratio, and the cumulative value is positively correlated with the sum of the second values; Distribute the intermediate results of the target to the reduce nodes; Receive a task addition instruction from a reduce node, wherein the task addition instruction carries a first number of second intermediate sub-results, the second intermediate sub-results being obtained by splitting the target intermediate result, wherein the first number is not greater than the number of reduce nodes; Each reduce task is generated based on the results of each second intermediate sub-task.
7. The method according to claim 3, characterized in that, The first key represents the file content of the first intermediate result to which it belongs, and the first value represents the file size of the first intermediate result to which it belongs.
8. A data processing method, characterized in that, The method is applied to reduce nodes, and the method includes: The reduce task is received from the aggregation module. The reduce task is obtained from the aggregation file, which contains several second intermediate results. Each second intermediate result contains a second key, a second value, and a correspondence between the second key and the second value. The second key is selected from the first key. Any two second keys are different. The second value corresponding to the second key is the aggregation result of the first value corresponding to the first key that is the same as the second key in each first intermediate result. The first intermediate result contains a first key, a first value, and a correspondence between the first key and the first value. The first intermediate result is output by the map node. Execute the reduce task to generate the reduce result.
9. The method according to claim 8, characterized in that, Receive reduce tasks from the aggregation module, including: Receive the target intermediate result from the aggregation module, wherein the target intermediate result is the second intermediate result to which the second value belongs is greater than the reference value, the reference value is obtained by multiplying the cumulative value by a specified ratio, and the cumulative value is positively correlated with the sum of the second values; The target intermediate result is split into a first number of second intermediate sub-results, wherein the first number is not greater than the number of reduce nodes; Generate a task addition instruction, wherein the task addition instruction includes the first number of second intermediate sub-results; Send the task addition command to the aggregation module; Receive reduce tasks from the aggregation module, wherein the reduce tasks are generated by adding instructions to the tasks.
10. A data processing apparatus, characterized in that, The device is used in the aggregation module, and the device includes: The first intermediate result acquisition module is configured to acquire each first intermediate result of each map node, wherein each first intermediate result contains a first key, a first value, and the correspondence between the first key and the first value; The aggregation file generation module is configured to generate an aggregation file, wherein the aggregation file contains a plurality of second intermediate results, each second intermediate result contains a second key, a second value, and a correspondence between the second key and the second value, the second key is selected from the first key, any two second keys are different, and the second value corresponding to the second key is: the aggregation result of the first value corresponding to the first key that is the same as the second key in each first intermediate result; The reduce task generation module is configured to generate each reduce task based on each aggregate file. The allocation module is configured to distribute each reduce task to each reduce node.
11. A data processing apparatus, characterized in that, The device is applied to a reduce node, and the device includes: The reduce task receiving module is configured to receive reduce tasks from the aggregation module. The reduce task is obtained from an aggregation file, which contains several second intermediate results. Each second intermediate result contains a second key, a second value, and a correspondence between the second key and the second value. The second key is selected from the first key, and any two second keys are different. The second value corresponding to the second key is the aggregation result of the first value corresponding to the first key that is the same as the second key in each first intermediate result. The first intermediate result contains a first key, a first value, and a correspondence between the first key and the first value. The first intermediate result is output by the map node. The reduce result generation module is configured to execute the reduce task and generate the reduce result.
12. An electronic device, characterized in that, It includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; Memory, used to store computer programs; When executing a program stored in memory, the processor implements the steps of the data processing method according to any one of claims 3-7, or the steps of the data processing method according to any one of claims 8 and 9.
13. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the data processing method according to any one of claims 3-7, or the steps of the data processing method according to any one of claims 8 and 9.