Processor and task execution control method
By adding a task scheduling module to the processor, software tasks are intelligently allocated based on topology information and running status, solving the problem of cross-node remote access caused by the randomness of task allocation in traditional RISC-V processors, and improving task execution efficiency and network performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANDONG YUNHAI GUOCHUANG CLOUD COMPUTING EQUIP IND INNOVATION CENT CO LTD
- Filing Date
- 2026-05-25
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional processors based on open instruction set architectures lack the ability to perceive the topology and operating status of on-chip interconnect networks in multi-node RISC-V systems. This leads to high randomness in software task allocation, resulting in a large number of cross-node remote accesses and low task execution efficiency.
A task scheduling module is added to the processor to intelligently allocate the process position of software tasks by sensing the topology information and operating status of the on-chip interconnect network, so as to minimize the cross-node communication cost. This includes node task allocation control, on-chip network performance scheduling, and node virtual reconfiguration control.
It effectively reduces the cross-node communication cost of software tasks, improves the efficiency of processors in executing software tasks, and optimizes the performance and flexibility of on-chip networks in multi-core RISC-V CPUs.
Smart Images

Figure CN122240342A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, and in particular to a processor and a task execution control method. Background Technology
[0002] With the increasing application of the Reduced Instruction Set Computing V (RISC-V) architecture in multi-node on-chip networks, multi-node RISC-V processors often employ mesh topology on-chip interconnect networks to interconnect computing nodes on the processor. However, in traditional solutions, the allocation of software tasks among RISC-V nodes is random, resulting in a large number of cross-node remote accesses during software task execution, causing excessive access latency and low overall software task execution efficiency. Summary of the Invention
[0003] This invention provides a processor and a task execution control method to at least solve the problem in related technologies where the random task allocation mechanism of processors based on open instruction set architecture leads to a large number of cross-node remote accesses, resulting in low task execution efficiency.
[0004] This invention provides a processor, including: an on-chip interconnect network and a task scheduling module; The on-chip interconnect network includes multiple nodes, each node having a computing node and a routing node connected to the computing node, and the multiple routing nodes are interconnected. The task scheduling module is communicatively connected to the node and is used to determine the allocation position of multiple processes of the software task in the on-chip interconnect network based on the topology information and the operating status of the on-chip interconnect network, so as to minimize the cross-node communication cost when multiple processes are executed.
[0005] The present invention also provides a task execution control method, applied to the task scheduling module in the above-mentioned processor, comprising: Determine the topology information of the processor's on-chip interconnect network and the operating status of the on-chip interconnect network; Based on the topology information and operating status of the on-chip interconnect network, the allocation positions of multiple processes of the software task in the on-chip interconnect network are determined to minimize the cross-node communication cost when multiple processes are executed.
[0006] This invention adds a hardware module for task scheduling to the processor. This task scheduling module is communicatively connected to nodes in the processor's on-chip interconnect network (IPN). Based on the topology and operating status of the IPN, it determines the allocation positions of multiple processes of a software task within the IPN, minimizing the cross-node communication cost during the execution of multiple processes. This solves the problem that traditional processors based on open instruction set architectures lack the ability to perceive the topology and operating status of the IPN, resulting in low task execution efficiency due to the use of random task allocation mechanisms and numerous cross-node remote accesses. This invention effectively shortens the cross-node communication cost between multiple processes of a software task and reduces the time spent on cross-node communication, thereby improving the processor's efficiency in executing software tasks. Attached Figure Description
[0007] To more clearly illustrate the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0008] Figure 1 This is an architecture diagram of the on-chip interconnect network for a multi-node RISC-V CPU in a traditional solution. Figure 2 A schematic diagram of a processor structure provided in an embodiment of the present invention; Figure 3 This is a schematic diagram of the structure of a node task allocation control module provided in an embodiment of the present invention; Figure 4 This is a schematic diagram of the structure of an on-chip network performance scheduling module provided in an embodiment of the present invention; Figure 5 This is a schematic diagram of the structure of a node virtual reconfiguration control module provided in an embodiment of the present invention; Figure 6 This is a flowchart of a task execution control method provided in an embodiment of the present invention. Detailed Implementation
[0009] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of the present invention.
[0010] It should be noted that, in the description of this invention, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. The terms "first," "second," etc., used in this invention are used to distinguish similar objects and are not used to describe a specific order or sequence.
[0011] To enable those skilled in the art to better understand the present invention, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0012] Here, we will first explain some key terms used in the embodiments of the present invention.
[0013] The Reduced Instruction Set Computing V (RISC-V) architecture is an open-source instruction set architecture based on the principle of reduced instruction set computing. Its features include modularity and extensibility, allowing users to customize instructions according to application scenarios.
[0014] Figure 1 This is an architecture diagram of the on-chip interconnect network for a multi-node RISC-V CPU in a traditional solution.
[0015] like Figure 1 As shown, taking a 4×4 scale with 16 nodes as an example, traditional multi-node RISC-V CPUs use a mesh-structured on-chip network for interconnection. Each node consists of two parts: a compute node and a routing node. The compute node is... Figure 1 The compute nodes are 0 to 15, and the routing nodes are... Figure 1The system consists of routing nodes 0 through 15. Each compute node (RV Node) contains one or more RISC-V cores, sharing an L2 cache, forming a cluster. The function of a compute node is to execute software tasks. Each compute node is connected to a routing node. The function of the routing node is to provide forwarding and routing calculations for relevant access protocols when a compute node's process needs to access cached data in other nodes, thus enabling cross-node access. A common routing algorithm is the XY routing algorithm, which starts from the source node, first moves along the X direction to the column corresponding to the destination node, and then moves along the Y direction to the row corresponding to the destination node. For example, to access the destination node (2,3) from the source node coordinates (0,0) (i.e., node 0 accessing node 11), the traditional access path is: first move along the X direction to the 3rd column, and then move along the Y direction to the destination node, specifically: node 0 → node 1 → node 2 → node 3 → node 7 → node 11.
[0016] However, in the aforementioned traditional approach, the randomness of software task allocation across computing nodes in the on-chip network leads to numerous cross-remote node accesses during software task execution, resulting in excessive access latency and low overall software task execution efficiency. For example, if multiple processes of the same software task are distributed across node 0 and node 15, the two computing nodes will traverse long paths when accessing each other's content.
[0017] Analysis revealed that this drawback stems from the distributed structure of the on-chip interconnect network between computing nodes. The scheduling mechanism of the operating system running on the computing nodes lacks the ability to perceive the topology information and operating status of the on-chip interconnect network, resulting in the use of coarse-grained scheduling mechanisms such as random allocation. This leads to a large number of cross-node remote accesses, wasting a significant amount of time on cross-node communication and also wasting communication resources, ultimately resulting in low task execution efficiency.
[0018] To address the problem of low task execution efficiency caused by the random task allocation mechanism of processors based on open instruction set architectures, which leads to numerous cross-node remote accesses, the processor and task execution control method provided in this invention add a hardware module for task scheduling to the processor. This task scheduling module is communicatively connected to nodes in the processor's on-chip interconnect network (IPN). Based on the topology and operating status of the IPN, it determines the allocation positions of multiple processes of a software task within the IPN, minimizing the cross-node communication cost during the execution of multiple processes. This solves the problem that traditional processors based on open instruction set architectures lack the ability to perceive the topology and operating status of the IPN, resulting in low task execution efficiency due to numerous cross-node remote accesses caused by the random task allocation mechanism. This effectively shortens the cross-node communication cost between multiple processes of a software task, reduces the time spent on cross-node communication, and thus improves the processor's efficiency in executing software tasks.
[0019] Figure 2 This is a schematic diagram of the structure of a processor provided in an embodiment of the present invention.
[0020] like Figure 2 As shown, the processor provided in this embodiment of the invention may include: an on-chip interconnect network and a task scheduling module; wherein, the on-chip interconnect network includes multiple nodes, each node having a computing node and a routing node connected to the computing node, and the multiple routing nodes are interconnected; the task scheduling module is communicatively connected to the nodes and is used to determine the allocation position of multiple processes of a software task in the on-chip interconnect network according to the topology information of the on-chip interconnect network and the operating status of the on-chip interconnect network, so as to minimize the cross-node communication cost when multiple processes are executed.
[0021] In practical implementation, the introduction to on-chip interconnect networks can be found in the above text. Figure 1 The introduction states that the connection methods between routing nodes can be... Figure 2 The mesh structure shown can also be other structures.
[0022] In this embodiment of the invention, the cost of cross-node communication can be determined based on the number of hops and the amount of data accessed in the access path between different processes of the software task.
[0023] In this embodiment of the invention, the task scheduling module may include a node task allocation control module; the node task allocation control module is used to determine the target node to which the software task to be allocated is assigned based on the amount of access data between multiple processes of the software task to be allocated on the on-chip interconnect network.
[0024] In other words, the node task allocation control module calculates the cross-node communication cost when assigning different processes to different candidate nodes based on the amount of data accessed between the multiple processes of the software task to be allocated and the communication distance between the candidate nodes. It then selects the allocation scheme that minimizes the cross-node communication cost as the target allocation scheme. After determining the allocation scheme, the node task allocation control module can send the target allocation scheme to the computing node that generated the task to be allocated, so that the operating system scheduling mechanism can execute the target allocation scheme.
[0025] The processor provided in this embodiment of the invention may further include an on-chip network performance scheduling module; the on-chip network performance scheduling module is used to configure the performance parameters of the virtual sub-network composed of the target nodes according to the performance requirements of the software tasks to be assigned before assigning the software tasks to be assigned to the target nodes.
[0026] In other words, the on-chip network performance scheduling module adjusts the operating status of each routing node involved in the virtual subnetwork, such as the clock frequency, according to the performance requirements of the software tasks to be assigned in terms of bandwidth and latency, so that the virtual subnetwork as a whole operates in a power consumption state that matches the performance requirements.
[0027] In this embodiment of the invention, the task scheduling module may include a node virtual reconstruction control module. The node virtual reconstruction control module is used to monitor the amount of access data between each pair of nodes. When there are two nodes whose access data amount reaches a first threshold, the software task of at least one of the nodes is migrated to shorten the access path between the corresponding nodes.
[0028] In other words, the node virtual reconfiguration control module monitors the communication frequency between nodes in real time. When it detects that the amount of access data between two nodes exceeds the first threshold, it determines that the communication cost under the current allocation scheme is too high and triggers the task migration process: selects a node closer to the target node from idle or low-load nodes as the migration target, and migrates the software task of one of the nodes to the target node, thereby shortening the effective communication distance between nodes and reducing the communication cost of subsequent cross-node access.
[0029] In this embodiment of the invention, the task scheduling module may include at least one of a node task allocation control module and a node virtual reconstruction control module.
[0030] The node task allocation control module intelligently allocates different software tasks to different nodes based on the real-time requirements of the nodes in the on-chip network, minimizing the sum of the product of the paths and data volumes for different nodes running the same task. Simultaneously, the on-chip network performance scheduling module dynamically adjusts the characteristics of the on-chip network according to the software's requirements, improving the network's performance utilization and reducing power consumption while meeting software performance demands. A node virtual reconfiguration control module is also designed to implement a dynamic virtual reconfiguration mechanism, enabling dynamic adjustment of task allocation on nodes without software awareness, significantly improving the performance and flexibility of on-chip network interconnection for multi-core RISC-V CPUs.
[0031] Based on the above embodiments, this embodiment of the invention further introduces the node task allocation control module.
[0032] Figure 3 This is a schematic diagram of the structure of a node task allocation control module provided in an embodiment of the present invention.
[0033] like Figure 3 As shown, in this embodiment of the invention, the node task allocation control module may include an inter-process access volume prediction submodule, an available node prediction submodule, and a node dynamic allocation submodule. The inter-process access volume prediction submodule is used to determine the predicted data access volume between multiple processes of the software task to be allocated. The available node prediction submodule is used to determine candidate nodes from the on-chip interconnect network based on the task execution status of the nodes in the on-chip interconnect network. The node dynamic allocation submodule is used to determine the candidate node that minimizes the cross-node communication cost between multiple processes of the software task to be allocated as the target node.
[0034] In practical implementation, the node task allocation control module intelligently selects available nodes from the multi-node network based on the inter-process access volume of the software task to be executed, and assigns different processes of the software task to different nodes, so that the sum of the paths when different nodes running the software task access each other is minimized. This ensures from the source that the total number of hops multiplied by the total data volume of the paths accessed by different processes of the current software task is minimized, thus constructing the optimal local sub-network for inter-node access of the current software task and greatly improving the overall performance of the multi-node RISC-V CPU.
[0035] In this embodiment of the invention, the cross-node communication cost of the software task to be assigned can be the sum of the cross-node communication costs of multiple processes; the cross-node communication cost of a process is the sum of the cross-node access costs of the process accessing other processes of the software task to be assigned; and the cross-node communication cost of a process accessing another process is the product of the amount of cross-node access data from the process to the other process and the number of node interval hops.
[0036] In this embodiment of the invention, determining the predicted value of the access data volume among multiple processes of a software task to be assigned may include: determining the first data volume of each process accessing each other process in the multiple processes of the software task to be assigned based on the data volume of memory access instructions whose memory access addresses are not in the address range of the process; and using the first data volume as the predicted value of the access data volume.
[0037] Specifically, based on the inter-process access volume estimation submodule, the amount of data accessed between multiple processes (each process will be assigned to a node) divided by the operating system is estimated before the current task is executed. The method is to analyze the number of memory access instructions (write instructions and load instructions) whose memory access addresses are not within the current process's address range, load_store_other_num, and the amount of data to be accessed by each instruction, load_store_bytes (the amount of data for a single memory access instruction varies depending on the instruction type; for example, each write instruction can load 1 to 8 bytes). A table of estimated cross-node access data for each process is created synchronously. The estimated cross-node access data for each process is calculated as load_store_bytes_0 + load_store_bytes_1 + ... (incremented by load_store_other_num times). Load_store_bytes_0 represents the data size of the first cross-process memory access instruction of the current process, load_store_bytes_1 represents the data size of the second cross-process memory access instruction of the current process, and so on, until the total cross-process (cross-node) access data of the current process is calculated and denoted as cross_process_data_num. At the same time, the data size of each other process is counted.
[0038] Taking a software task to be assigned corresponding to 4 processes as an example. Table 1 is the estimated data volume table for cross-node access of processes.
[0039] Table 1
[0040] Based on the available node estimation submodule, using the process data required for the current software task obtained from the preceding module, and considering the current running status of each node in the multi-node RISC-V CPU, this submodule can select nodes that are twice the number of required processes and whose currently running tasks are about to finish. For example, using Table 1 as an example, if the upcoming software task requires 4 processes, then this submodule's function is to select 8 nodes (from...). Figure 2The nodes are selected from the 16 total nodes shown. In practical applications, the number of total nodes is not limited to 16; common numbers include 32, 64, and 128. The principle for selecting nodes can be based on whether the software task currently running on a node is about to finish (e.g., the remaining execution time is less than a first preset time), so as to free up node resources to run new software tasks. One method is to estimate the total time required for all nodes to complete the current task, and then select the first 8 nodes in ascending order of total execution time (the time estimation method is not mandatory and can be determined based on the number of instructions to be completed in the process).
[0041] In some optional embodiments of the present invention, the candidate nodes are all single-core computing nodes. Therefore, determining candidate nodes from the on-chip interconnect network (IPN) based on the task execution status of the nodes in the IPN can include: determining nodes from the IPN whose number is greater than the number of processes required for the software task to be assigned, based on the task execution status of the nodes in the IPN. Determining the candidate node that minimizes the cross-node communication cost among the multiple processes of the software task to be assigned, as the target node, can include: determining the candidate node combination corresponding to allocating the multiple processes of the software task to be assigned one by one to the candidate nodes; calculating the cross-node communication cost corresponding to allocating the multiple processes of the software task to be assigned to the candidate node combination; and determining the candidate node in the candidate node combination with the minimum cross-node communication cost as the target node.
[0042] Based on the node-dynamic allocation submodule, from the nodes where the estimated number of available nodes is twice the required number of processes, the most suitable node for the upcoming software task is found according to Table 1, and the optimal matching of processes and nodes is achieved. "Most suitable" means that after processes are allocated to the corresponding nodes, the total number of hops in the access paths between them multiplied by the total data volume is minimized. To achieve the optimal matching, this submodule uses a permutation and combination calculation method. In this example, the total number of possibilities to be calculated is C(8,4) (i.e., the number of combinations of selecting 4 elements from 8 elements). The result is 70 possible permutations and combinations.
[0043] For ease of illustration, let's assume that the 8 nodes obtained by the available node prediction submodule are node 0, node 3, node 5, node 6, node 8, node 10, node 13, and node 15.
[0044] The total number of hops for all processes under the current node allocation is calculated as: the amount of data accessed from the source node to the destination node × the number of hops between the two nodes.
[0045] The analysis of the above 70 candidate node combinations is as follows.
[0046] For candidate node combination 0, it is set as follows: process 0 is assigned to node 0, process 1 to node 3, process 2 to node 5, and process 3 to node 6. According to Table 1, process 0 needs to access 20 data points from process 1, and the hop count between node 0 and node 3 is 3; process 0 needs to access 10 data points from process 2, and the hop count between node 0 and node 5 is 2; process 0 needs to access 20 data points from process 3, and the hop count between node 0 and node 6 is 3. Therefore, for process 0, the total data volume × hop count = 20 × 3 + 10 × 2 + 20 × 3 = 140.
[0047] Based on this calculation, for process 1: the amount of data to be accessed from process 0 is 50, and the hop count between node 3 and node 0 is 3; the amount of data to be accessed from process 2 is 40, and the hop count between node 3 and node 5 is 3; the amount of data to be accessed from process 3 is 10, and the hop count between node 3 and node 6 is 2. Therefore, for process 1, the total amount of data × hop count = 50 × 3 + 40 × 3 + 10 × 2 = 290.
[0048] For process 2: the amount of data to be accessed from process 0 is 100, and the hop count between node 5 and node 0 is 2; the amount of data to be accessed from process 1 is 50, and the hop count between node 5 and node 3 is 3; the amount of data to be accessed from process 3 is 50, and the hop count between node 5 and node 6 is 1. Therefore, for process 2, the total amount of data × hop count = 100 × 2 + 50 × 3 + 50 × 1 = 400.
[0049] For process 3: the amount of data to be accessed from process 0 is 10, and the hop count between node 6 and node 0 is 3; the amount of data to be accessed from process 1 is 5, and the hop count between node 6 and node 3 is 2; the amount of data to be accessed from process 2 is 5, and the hop count between node 6 and node 5 is 1. Therefore, for process 3, the total amount of data × hop count = 10 × 3 + 5 × 2 + 5 × 1 = 45.
[0050] Therefore, for candidate node combination 0, the total amount of data × number of hops for all processes under the current node allocation is 140 + 290 + 400 + 45 = 875.
[0051] The above calculations yield the minimum sum of the cross-node data access volume multiplied by the hop count for all processes across all node allocation modes. Based on this minimum value, the optimal node allocation for the current software task's processes can be determined.
[0052] The above analysis of candidate node combinations can be implemented using parallel computing, allowing computationally intensive operations to be completed simultaneously before comparison. It's important to note that selecting twice the number of processes for computation in the preceding modules is to comprehensively consider minimizing the sum of data volume × hop count and the completion time of the currently running tasks on each node, thereby achieving optimal node allocation.
[0053] Furthermore, calculating the cross-node communication cost corresponding to allocating multiple processes of the software task to be assigned to the candidate node combination may include: determining the allocation mode corresponding to allocating multiple processes of the software task to be assigned to the candidate nodes in the candidate node combination one by one; calculating the cross-node communication cost corresponding to each allocation mode; determining the allocation mode with the minimum cross-node communication cost as the candidate allocation mode corresponding to the candidate node combination, and using the corresponding cross-node communication cost as the cross-node communication cost corresponding to the candidate node combination.
[0054] In other words, for the same group of candidate nodes (i.e. which nodes are selected to participate in the allocation), different cross-node communication costs will be generated due to the different correspondence between processes and nodes. Therefore, it is necessary to traverse all allocation patterns under the group of candidate nodes and select the allocation pattern with the minimum cross-node communication cost as the optimal allocation scheme for the combination of candidate nodes.
[0055] In some optional embodiments of the present invention, at least one of the candidate nodes is a multi-core computing node. Determining candidate nodes from the on-chip interconnect network (IPN) based on the task execution status of the nodes in the IPN may include: determining candidate nodes from the IPN based on the task execution status of the nodes in the IPN, such that the number of candidate computing cores provided by the candidate nodes is greater than the number of processes required by the software task to be assigned. Determining the candidate node that minimizes the cross-node communication cost among the multiple processes of the software task to be assigned as the target node may include: determining the candidate computing core combination corresponding to the candidate computing cores that allocate the multiple processes of the software task to be assigned one by one; calculating the cross-node communication cost corresponding to allocating the multiple processes of the software task to be assigned to the candidate computing core combination; and determining the candidate node in the candidate computing core combination with the minimum cross-node communication cost as the target node.
[0056] In other words, when candidate nodes include multi-core computing nodes, each multi-core computing node contains multiple computing cores, which can run different processes independently, resulting in significantly lower communication latency compared to cross-node communication. When selecting candidate resources, it is necessary to refine the selection to the computing core granularity, ensuring that the total number of candidate computing cores exceeds the required number of processes, thereby providing sufficient candidate units for subsequent optimal allocation.
[0057] Furthermore, calculating the cross-node communication cost corresponding to allocating multiple processes of the software task to be assigned to the candidate computing core combination may include: determining the allocation mode corresponding to allocating the multiple processes of the software task to be assigned to the candidate computing cores in the candidate computing core combination one by one; calculating the cross-node communication cost corresponding to each allocation mode; determining the allocation mode with the minimum cross-node communication cost as the candidate allocation mode corresponding to the candidate computing core combination, and using the corresponding cross-node communication cost as the cross-node communication cost corresponding to the candidate computing core combination.
[0058] Similar to the single-core computing node scenario, at the candidate computing core granularity, it is also necessary to traverse different allocation patterns between processes and computing cores, calculate the cross-node communication cost for each pattern, and select the minimum value as the optimal allocation scheme for that candidate computing core combination. It should be noted that when two processes are allocated to different computing cores within the same node, their communication distance metric (such as hop count) can be set to 0 or a preset value much smaller than the cross-node communication cost.
[0059] Based on the above embodiments, the embodiments of the present invention further describe the on-chip network performance scheduling module.
[0060] Figure 4 This is a schematic diagram of the structure of an on-chip network performance scheduling module provided in an embodiment of the present invention.
[0061] In this embodiment of the invention, the on-chip network performance scheduling module is used to dynamically adjust the characteristics of the virtual subnetwork, which is composed of a predetermined software task to be run and an allocated process node, based on the output of the node task allocation control module, according to the performance requirements of the software task. This is to reduce the power consumption of the virtual subnetwork and improve the performance utilization of the virtual subnetwork while meeting the performance requirements of the software task.
[0062] For example, if a software process is assigned to node 0, node 2, node 4, or node 5 by the node task allocation control module, then these four nodes and the on-chip network connections in between form the virtual sub-network corresponding to the software task.
[0063] like Figure 4 As shown, the on-chip network performance scheduling module may include a software task performance requirement estimation submodule and a virtual subnetwork characteristic adjustment submodule. The software task performance requirement estimation submodule determines the target execution time of a single cross-node memory access instruction within a software task to be assigned, based on the condition that the total execution time of the software task to be assigned is less than or equal to the allowed execution time of the software task to be assigned. The virtual subnetwork characteristic adjustment submodule configures the clock frequency of the routing nodes in the virtual subnetwork according to the execution time of the single cross-node memory access instruction.
[0064] Based on the software task performance requirement prediction submodule, the on-chip network performance requirements of the software task are predicted. The method is to set the clock frequency of the routing nodes in the optimal virtual subnetwork according to the total requirement of the software task completion time.
[0065] In this embodiment of the invention, the total execution time is the sum of the execution time of non-memory access instructions, the execution time of non-cross-node memory access instructions, and the execution time of cross-node memory access instructions for the software task to be assigned. Specifically, the execution time of non-memory access instructions is determined based on the execution time of a single non-memory access instruction in the software task to be assigned and the number of non-memory access instructions in the software task to be assigned; the execution time of non-cross-node memory access instructions is determined based on the execution time of a single non-cross-node memory access instruction in the software task to be assigned and the number of non-cross-node memory access instructions in the software task to be assigned; and the execution time of cross-node memory access instructions is determined based on the execution time of a single cross-node memory access instruction in the software task to be assigned and the number of cross-node memory access instructions in the software task to be assigned.
[0066] In other words, the decision-making process of the software task performance requirement prediction submodule can be based on the following formula: Execution time of a single non-memory access instruction (T0) × Number of non-memory access instructions (NUM0) + Execution time of a single non-cross-node memory access instruction (T1) × Number of non-cross-node memory access instructions (NUM1) + Execution time of a single cross-node memory access instruction (T2) × Number of cross-node access instructions (NUM2) ≤ Total time required to complete the task (T_ALL).
[0067] Regarding the variables mentioned above: T0 can be obtained by averaging the non-memory access instructions for previously completed tasks; NUM0 can be obtained by analyzing the instructions in the cache; T1 can be obtained by averaging the non-cross-node memory access instructions for previously completed tasks; NUM1 can be obtained by analyzing the instructions in the cache; and NUM2 can be obtained by analyzing the instructions in the cache. Only T2 requires analysis based on the specific virtual subnet. Therefore, the function of this submodule is actually to obtain the maximum allowable value of T2 while satisfying the above formulas. T_ALL is the time limit requirement issued by the software task.
[0068] Based on the characteristics of the virtual sub-network, the sub-module is adjusted, and the clock frequency of the routing nodes in the virtual sub-network is dynamically adjusted according to the T2 obtained by the sub-module based on the performance requirements of the preceding software task.
[0069] In this embodiment of the invention, configuring the clock frequency of the routing nodes in the virtual subnetwork based on the execution time of a single cross-node memory access instruction may include: configuring the clock frequency of the routing nodes in the virtual subnetwork with the objective that the sum of the routing delay time of the software task to be assigned and the execution time of a single non-cross-node memory access instruction is less than the target execution time of a single cross-node memory access instruction; wherein, the routing delay time is obtained by dividing the ratio of the total number of cross-node accesses of the software task to be assigned to the number of instructions of the software task to be assigned by the clock frequency of the routing nodes in the virtual subnetwork.
[0070] In other words, the decision-making process of the virtual sub-network characteristic adjustment submodule can be implemented based on the following formula: Total number of hops for task cross-node access / Number of instructions for task cross-node access × (1 / Clock frequency of routing node) + T1 ≤ T2. The total number of hops for task cross-node access (JUMP_NUM) can be calculated using Table 1.
[0071] The number of instructions (instruction_cross_NUM) that a task accesses across nodes can also be calculated using Table 1.
[0072] The number of instructions (instruction_cross_NUM) that a task accesses across nodes can also be calculated using Table 1.
[0073] The above formula can be explained as follows: (1) Dividing the total number of hops for cross-node access by the number of instructions for cross-node access gives the average number of hops for each cross-node access instruction in the virtual subnet. (2) Multiplying the average number of hops by (1 / clock frequency of the routing node) gives the routing delay time for each cross-node access instruction in the virtual subnet. (3) Adding T1 (running time of a single non-cross-node memory access instruction) to the routing delay time gives the average running time of each cross-node access instruction.
[0074] The clock frequency of the routing node can be calculated using the above formula.
[0075] Based on the calculated clock frequency of the routing nodes, set it to each routing node in the virtual subnetwork.
[0076] In this way, the on-chip network performance scheduling module can ensure that the power consumption of the virtual sub-network is kept to a minimum while meeting the performance requirements of the software task, thereby improving the performance utilization of the virtual sub-network.
[0077] It should be noted that different software tasks correspond to different virtual subnetworks, and the clock frequencies set are also different. Therefore, in this embodiment of the invention, the clock frequency of the routing nodes of the entire multi-node RISC-V CPU can be different in different time periods, and also in the virtual subnetworks divided by different tasks in the same time period, realizing dynamic adjustment and greatly improving the performance utilization of the on-chip network.
[0078] Based on the above embodiments, the present invention further describes the node virtual reconstruction control module.
[0079] Figure 5 This is a schematic diagram of the structure of a node virtual reconfiguration control module provided in an embodiment of the present invention.
[0080] In this embodiment of the invention, the node virtual reconfiguration control module monitors the frequency of memory access requests received on each node, as well as the frequency of memory access requests issued by each node. When the frequency of requests initiated and received between two nodes exceeds a certain threshold, the structure of the on-chip network is virtually reconfigured. Virtual reconfiguration refers to moving the process context and cached content of one node to a node physically closer to another node without the software's awareness, thereby reducing access latency for subsequent cross-node accesses between the two nodes.
[0081] For example, if real-time monitoring reveals that the frequency of mutual access between node 0 and node 3 exceeds a threshold, the context and cached content of the software task running on node 3 are swapped with the context and cached content of the process running on node 1 or node 4, thereby realizing the virtual reconstruction of the on-chip network.
[0082] like Figure 5 As shown, the node virtual reconfiguration control module may include a node access monitoring submodule and a node data adjustment submodule. The node access monitoring submodule monitors the amount of data accessed between any two nodes in the on-chip interconnect network. The node data adjustment submodule, when two nodes have access data volumes reaching a first threshold, migrates the software tasks of at least one of the nodes to shorten the corresponding inter-node access path.
[0083] In practical implementation, the node access monitoring submodule monitors all nodes in the on-chip network (as shown in the original text). Figure 2 (Taking 16 nodes as an example) The frequency of cross-node access requests received is freq_receive. The number of receiving frequencies for each node is the total number of nodes minus 1. Figure 2There are 15 frequencies. For example, for node 0, there are 15 receiving frequencies (freq_receive_1, freq_receive_2, freq_receive_3...freq_receive_15), corresponding to the cross-node access request frequencies (the number of requests within a specific time period) from the other 15 nodes. Simultaneously, the cross-node access frequencies (freq_send) sent by all nodes are monitored. For example, for node 0, there are 15 sending frequencies (freq_send_1, freq_send_2, freq_send_3...freq_send_15). Based on this, the two nodes with the highest mutual access frequency are calculated and selected, such as node 0 and node 3. Furthermore, the frequency of mutual access between these two nodes must exceed the configured threshold frequency (freq_cfg); otherwise, this module's function is disabled. This ensures that this module's function is only activated when the mutual access frequency between two nodes is very frequent, achieving a balance between performance and complexity.
[0084] In this embodiment of the invention, when there are two nodes whose access data volume reaches a first threshold, migrating the software tasks of at least one of the nodes to shorten the access path between the corresponding nodes may include: determining a first node and a second node among the two nodes whose access data volume reaches the first threshold; determining a node to be replaced from the on-chip interconnect network, wherein the number of access hops between the node to be replaced and the first node is less than the number of access hops between the second node and the first node; and migrating the software tasks of the second node to the node to be replaced.
[0085] In other words, when the amount of data accessed between two nodes is large and the access path between the nodes is long, the software task data of only one node can be migrated to shorten the access path between the two nodes. For example, in the example above, node 0 is selected as the node not to be adjusted (the first node), and node 3 is selected as the node to be adjusted (the second node).
[0086] In this embodiment of the invention, the software task of migrating a node may include: determining the node to be replaced corresponding to the node; if there is a running process on the node to be replaced, writing the process context content and cached data of the running node to the first storage space to vacate the node to be replaced; migrating the process context content and cached data of the running node to the vacated node to vacate the node; and migrating the process context content and cached data of the running node originally belonging to the running node to the vacated node in the first storage space to the vacated node.
[0087] In other words, a node is selected from the surrounding nodes of the node that is not being adjusted (the first node) as the replacement node. The selection of the replacement node follows two principles: first, the hop count between this node and the node not being adjusted is 1, meaning they are physically adjacent; second, the process running on this node is about to terminate. After selecting the replacement node, the context and cached data of the currently running process in the replacement node are first written to the first storage space of the virtual reconfiguration control module; then, the context and cached data of the adjusted node (the second node) are written to the vacated replacement node; finally, the context and cached content of the replacement node cached in the virtual reconfiguration control module are written to the now vacated adjusted node.
[0088] Meanwhile, the node IDs of the nodes to be replaced and the nodes to be adjusted remain unchanged, so the upper-layer software tasks do not need to be aware of the changes in the nodes, thus achieving seamless software migration.
[0089] In this way, this module reduces the access path between the two nodes with the highest mutual access frequency to a minimum, thereby realizing the virtual reconstruction of the on-chip network.
[0090] The embodiments of the present invention provide a task execution control method. The method is described in detail below in conjunction with the execution flow of the task execution control method.
[0091] Figure 6 This is a flowchart of a task execution control method provided in an embodiment of the present invention.
[0092] The task execution control method provided in this embodiment of the invention can be implemented based on the processor provided in any of the above embodiments. For example... Figure 6 As shown, the task execution control method provided in this embodiment of the invention may include: S601: determining the topology information of the processor's on-chip interconnect network and the operating status of the on-chip interconnect network.
[0093] S602: Based on the topology information and operating status of the on-chip interconnect network, determine the allocation position of multiple processes of the software task in the on-chip interconnect network in order to minimize the cross-node communication cost when multiple processes are executed.
[0094] In a specific implementation, for S601, the task scheduling module deployed in the processor provided in this embodiment of the invention can be used to perceive the topology information and operating status of the on-chip interconnect network. The topology information may include the connection relationships of each node and the number of routing hops. The operating status may include node load status, link occupancy rate, and other operating status information.
[0095] In S602, based on the topology information and operating status of the on-chip interconnect network, the allocation positions of multiple processes of the software task in the on-chip interconnect network are determined to minimize the cross-node communication cost when multiple processes are executed. This includes: responding to the information of the software task to be allocated in the on-chip interconnect network, and determining the target node to which the software task to be allocated is assigned based on the amount of access data between the multiple processes of the software task to be allocated, with the goal of minimizing the cross-node communication cost when multiple processes are executed.
[0096] The task execution control method provided in this invention adds a hardware module for task scheduling to the processor. This task scheduling module is communicatively connected to nodes in the processor's on-chip interconnect network (IPN). Based on the topology and operating status of the IPN, it determines the allocation positions of multiple processes of a software task within the IPN, minimizing the cross-node communication cost during execution. This solves the problem that traditional processors based on open instruction set architectures lack the ability to perceive the topology and operating status of the IPN, resulting in low task execution efficiency due to the reliance on random task allocation mechanisms and numerous cross-node remote accesses. The method effectively reduces the cross-node communication cost between multiple processes of a software task and decreases the time spent on cross-node communication, thereby improving the processor's efficiency in executing software tasks.
[0097] Based on the above embodiments, in this embodiment of the invention, the target node to which the software task to be assigned is determined according to the amount of access data between multiple processes of the software task to be assigned, with the goal of minimizing the cross-node communication cost during the execution of multiple processes. This includes: determining a predicted value of the amount of access data between multiple processes of the software task to be assigned; determining candidate nodes from the on-chip interconnect network according to the task execution status of nodes in the on-chip interconnect network; and determining the candidate node that minimizes the cross-node communication cost between multiple processes of the software task to be assigned as the target node from the candidate nodes.
[0098] In other words, the amount of data access between processes of the software task to be assigned can be estimated, and then the available candidate nodes can be selected based on the current task execution status of the nodes. Finally, the node that minimizes the cross-node communication cost can be selected from the candidate nodes as the target node.
[0099] In this embodiment of the invention, the cross-node communication cost of the software task to be assigned is the sum of the cross-node communication costs of multiple processes; the cross-node communication cost of a process is the sum of the cross-node access costs of the process accessing other processes of the software task to be assigned; and the cross-node communication cost of a process accessing another process is the product of the amount of cross-node access data from the process to the other process and the number of hops between nodes.
[0100] In other words, we can first calculate the sum of the cross-node access costs of a single process accessing each of the other processes, and then sum them to obtain the communication cost of that process. Finally, we can add up the communication costs of all processes to obtain the cross-node communication cost of the entire software task.
[0101] In this embodiment of the invention, determining the predicted value of the access data volume among multiple processes of a software task to be assigned may include: determining the first data volume of each process accessing each other process in the multiple processes of the software task to be assigned based on the data volume of memory access instructions whose memory access addresses are not in the address range of the process; and using the first data volume as the predicted value of the access data volume.
[0102] In other words, by analyzing memory access operations across address ranges in the instructions of each process, the amount of data accessed by that process to other processes can be estimated.
[0103] In some optional embodiments of the present invention, the computing nodes of the candidate nodes are all single-core computing nodes. Determining candidate nodes from the on-chip interconnect network (IPN) based on the task execution status of the nodes in the IPN may include: determining nodes from the IPN whose number is greater than the number of processes required for the software task to be assigned, based on the task execution status of the nodes in the IPN. Determining the candidate node that minimizes the cross-node communication cost among the multiple processes of the software task to be assigned, as the target node, may include: determining the candidate node combination corresponding to allocating the multiple processes of the software task to be assigned one by one to the candidate nodes; calculating the cross-node communication cost corresponding to allocating the multiple processes of the software task to be assigned to the candidate node combination; and determining the candidate node in the candidate node combination with the minimum cross-node communication cost as the target node.
[0104] In other words, when all nodes are single-core, first select candidate nodes that are more numerous than the required processes, then iterate through all possible node combinations, calculate the cross-node communication cost for each combination, and select the node in the combination with the lowest cost as the target node.
[0105] Furthermore, the calculation of the cross-node communication cost corresponding to allocating multiple processes of the software task to be assigned to the candidate node combination includes: determining the allocation mode corresponding to allocating multiple processes of the software task to be assigned to the candidate nodes in the candidate node combination one by one; calculating the cross-node communication cost corresponding to each allocation mode; determining the allocation mode with the minimum cross-node communication cost as the candidate allocation mode corresponding to the candidate node combination, and using the corresponding cross-node communication cost as the cross-node communication cost corresponding to the candidate node combination.
[0106] In other words, for the same group of candidate nodes, different correspondences between processes and nodes will generate different communication costs. Therefore, it is necessary to traverse all allocation patterns under the group and select the one with the lowest cost as the optimal allocation scheme for the candidate node combination.
[0107] In some optional embodiments of the present invention, at least one of the candidate nodes is a multi-core computing node. Determining candidate nodes from the on-chip interconnect network (IPN) based on the task execution status of the nodes in the IPN may include: determining candidate nodes from the IPN based on the task execution status of the nodes in the IPN, such that the number of candidate computing cores provided by the candidate nodes is greater than the number of processes required by the software task to be assigned. Determining the candidate node that minimizes the cross-node communication cost among the multiple processes of the software task to be assigned as the target node may include: determining the candidate computing core combination corresponding to the candidate computing cores to which the multiple processes of the software task to be assigned are assigned one by one; calculating the cross-node communication cost corresponding to the allocation of the multiple processes of the software task to be assigned to the candidate computing core combination; and determining the candidate node in the candidate computing core combination with the minimum cross-node communication cost as the target node.
[0108] In other words, when there are multi-core nodes, resources are selected and allocated down to the computing core. Multiple cores within the same node can run different processes, and the communication cost between cores is much lower than that of cross-node communication.
[0109] Furthermore, calculating the cross-node communication cost corresponding to allocating multiple processes of the software task to be assigned to the candidate computing core combination may include: determining the allocation mode corresponding to allocating the multiple processes of the software task to be assigned to the candidate computing cores in the candidate computing core combination one by one; calculating the cross-node communication cost corresponding to each allocation mode; determining the allocation mode with the minimum cross-node communication cost as the candidate allocation mode corresponding to the candidate computing core combination, and using the corresponding cross-node communication cost as the cross-node communication cost corresponding to the candidate computing core combination.
[0110] In other words, at the core-level, it is also necessary to traverse different allocation patterns between processes and cores, calculate the cross-node communication cost for each pattern, and set the communication distance between cores within the same node to 0 or much less than the preset value for cross-node communication, so as to prioritize the allocation of frequently communicating processes to different cores within the same node.
[0111] Based on the above embodiments, the task execution control method provided by the present invention may further include: before assigning the software task to be assigned to the target node, configuring the performance parameters of the virtual sub-network composed of the target node according to the performance requirements of the software task to be assigned.
[0112] In other words, after the target nodes are determined and before the tasks are formally assigned, the virtual subnetwork consisting of these target nodes and the network connections between them is configured according to the performance requirements of the software task in terms of bandwidth, latency, etc., so that it can meet the performance requirements of the task and avoid unnecessary power consumption.
[0113] In this embodiment of the invention, configuring the performance parameters of the virtual subnetwork composed of target nodes according to the performance requirements of the software task to be assigned may include: determining the target execution time of a single cross-node memory access instruction in the software task to be assigned, based on the condition that the total execution time of the software task to be assigned is less than or equal to the allowed execution time of the software task to be assigned; and configuring the clock frequency of the routing nodes in the virtual subnetwork according to the execution time of the single cross-node memory access instruction.
[0114] In other words, we can first deduce the maximum allowed running time for each cross-node memory access instruction based on the overall time constraint of the task, and then set the clock frequency of the routing node with this as the target, so as to select the lowest possible frequency to reduce power consumption while meeting performance requirements.
[0115] In this embodiment of the invention, the total execution time can be the sum of the execution time of non-memory access instructions, the execution time of non-cross-node memory access instructions, and the execution time of cross-node memory access instructions for the software task to be assigned; wherein, the execution time of non-memory access instructions is determined based on the execution time of a single non-memory access instruction of the software task to be assigned and the number of non-memory access instructions in the software task to be assigned; the execution time of non-cross-node memory access instructions is determined based on the execution time of a single non-cross-node memory access instruction of the software task to be assigned and the number of non-cross-node memory access instructions in the software task to be assigned; and the execution time of cross-node memory access instructions is determined based on the execution time of a single cross-node memory access instruction of the software task to be assigned and the number of cross-node memory access instructions in the software task to be assigned.
[0116] In other words, the total execution time can be composed of the execution times of three types of instructions: non-memory access instructions, non-cross-node memory access instructions, and cross-node memory access instructions. The execution times of the first two types of instructions can be obtained through historical statistics or instruction analysis, while the execution time of cross-node memory access instructions is related to the clock frequency of the routing nodes and is a target variable that needs to be configured.
[0117] In this embodiment of the invention, configuring the clock frequency of the routing nodes in the virtual subnetwork based on the execution time of a single cross-node memory access instruction includes: configuring the clock frequency of the routing nodes in the virtual subnetwork with the objective that the sum of the routing delay time of the software task to be assigned and the execution time of a single non-cross-node memory access instruction is less than the target execution time of a single cross-node memory access instruction; wherein, the routing delay time is obtained by dividing the ratio of the total number of cross-node accesses of the software task to be assigned to the number of instructions of the software task to be assigned by the clock frequency of the routing nodes in the virtual subnetwork.
[0118] In other words, the actual execution time of cross-node memory access instructions includes two parts: routing latency and execution time of non-cross-node memory access instructions. Routing latency equals the average hop count divided by the clock frequency of the routing node, where the average hop count is obtained by dividing the total number of hops the task accesses across nodes by the number of cross-node access instructions. By adjusting the clock frequency, the sum of these values is kept within the previously determined target execution time, thus selecting the lowest clock frequency while meeting performance requirements.
[0119] Based on the above embodiments, in this embodiment of the invention, the allocation position of multiple processes of a software task in the on-chip interconnect network is determined according to the topology information and operating status of the on-chip interconnect network, so as to minimize the cross-node communication cost when multiple processes are executed. This may include: monitoring the amount of access data between each pair of nodes in the on-chip interconnect network; when there are two nodes whose access data amount reaches a first threshold, migrating the software task of at least one of the nodes to shorten the corresponding access path between nodes.
[0120] In other words, during the execution of software tasks, the amount of data accessed between each node is monitored in real time. When it is found that the amount of data accessed between two nodes exceeds a preset threshold, it indicates that the communication between the two nodes is too frequent. At this time, by migrating the software task on one of the nodes, the two nodes are brought closer in physical location, thereby shortening the subsequent access path and reducing communication latency.
[0121] In this embodiment of the invention, when there are two nodes whose access data volume reaches a first threshold, migrating the software tasks of at least one of the nodes to shorten the access path between the corresponding nodes may include: determining a first node and a second node among the two nodes whose access data volume reaches the first threshold; determining a node to be replaced from the on-chip interconnect network, wherein the number of access hops between the node to be replaced and the first node is less than the number of access hops between the second node and the first node; and migrating the software tasks of the second node to the node to be replaced.
[0122] In other words, one of the two nodes that frequently access each other is selected as the non-adjusted node (the first node), and the other is selected as the adjusted node (the second node). Then, a node to be replaced is selected from the physical neighbors of the non-adjusted node. The software tasks on the adjusted node are migrated to the node to be replaced, so that the two nodes are closer in physical location after adjustment (with fewer access hops), thereby reducing the latency of subsequent mutual access.
[0123] In this embodiment of the invention, the software task of migrating a node may include: determining the node to be replaced corresponding to the node; if there is a running process on the node to be replaced, writing the process context content and cached data of the running node to the first storage space to vacate the node to be replaced; migrating the process context content and cached data of the running node to the vacated node to vacate the node; and migrating the process context content and cached data of the running node originally belonging to the running node to the vacated node in the first storage space to the vacated node.
[0124] In other words, the migration process can include: first, temporarily storing the original process context and cached data of the node to be replaced in external storage space, making the node to be replaced idle; then, migrating the process context and cached data of the node to be migrated (the adjustment node) to the vacated node to be replaced, making the original adjustment node idle; finally, migrating the process context and cached data of the original node to be replaced from the temporarily stored external storage space to the newly vacated adjustment node. In this way, the exchange of software tasks between the two nodes is achieved, and since the node ID remains unchanged, the upper-layer software tasks do not need to be aware of the changes in the underlying nodes.
[0125] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Since the task execution control method provided in the embodiments of the present invention corresponds to the processor provided in the embodiments of the present invention, the parts not described in the method embodiments can be referred to the description of the processor embodiments.
[0126] Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, wherein the computer program is configured to execute the steps in any of the above-described task execution control method embodiments when it is run.
[0127] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.
[0128] Embodiments of the present invention also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in any of the above-described task execution control method embodiments.
[0129] Embodiments of the present invention also provide another computer program product, including a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps in any of the above-described task execution control method embodiments.
[0130] Any of the components, modules, units, parts, methods, and operations described herein can be implemented using software, firmware, hardware (e.g., fixed logic circuitry), manual processing, or any combination thereof. Alternatively or additionally, any functionality described herein can be performed at least in part by one or more hardware logic components, such as, but not limited to, a central processing unit (CPU), a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), an application-specific standard product (ASSP), a system-on-a-chip (SoC), a complex programmable logic device (CPLD), a microprocessor (MCU), etc. The terms "system," "computing device," or "apparatus" as used herein encompass various means, devices, and machines for processing data, including, for example, one or more programmable processors, computers, SoCs, or combinations thereof. The apparatus may also include code that creates an execution environment for the computer program in question, such as code constituting processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or one or more combinations thereof. The aforementioned computer program (also known as a program, software, software application, app, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and can be deployed in any form, including as a standalone program or as a module, component, subroutine, object, or other unit suitable for a computing environment.
[0131] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.
[0132] The present invention has provided a detailed description of a processor and task execution control method. Specific examples have been used to illustrate the principles and implementation methods of the invention. The descriptions of these embodiments are only intended to aid in understanding the method and core ideas of the present invention. It should be noted that those skilled in the art can make various improvements and modifications to the present invention without departing from its principles, and these improvements and modifications also fall within the protection scope of the present invention.
Claims
1. A processor, characterized in that, include: On-chip interconnect network and task scheduling module; The on-chip interconnect network includes multiple nodes, each node having a computing node and a routing node connected to the computing node, and the multiple routing nodes are interconnected. The task scheduling module is communicatively connected to the node and is used to determine the allocation position of multiple processes of the software task in the on-chip interconnect network based on the topology information and the operating status of the on-chip interconnect network, so as to minimize the cross-node communication cost when multiple processes are executed.
2. The processor according to claim 1, characterized in that, The task scheduling module includes at least a node task allocation control module; The node task allocation control module is used to determine the target node to which the software task to be allocated is assigned based on the amount of access data between multiple processes of the software task to be allocated in the on-chip interconnect network.
3. The processor according to claim 2, characterized in that, It also includes an on-chip network performance scheduling module; The on-chip network performance scheduling module is used to configure the performance parameters of the virtual sub-network formed by the target nodes according to the performance requirements of the software tasks to be assigned before assigning the software tasks to the target nodes.
4. The processor according to claim 1, characterized in that, The task scheduling module includes at least a node virtual reconstruction control module. The node virtual reconstruction control module is used to monitor the amount of access data between each pair of nodes. When there are two nodes whose access data amount reaches a first threshold, the software task of at least one of the nodes is migrated to shorten the access path between the corresponding nodes.
5. A task execution control method, characterized in that, A task scheduling module applied in the processor according to any one of claims 1 to 4, comprising: Determine the topology information of the processor's on-chip interconnect network and the operating status of the on-chip interconnect network; Based on the topology information and operating status of the on-chip interconnect network, the allocation positions of multiple processes of the software task in the on-chip interconnect network are determined to minimize the cross-node communication cost when multiple processes are executed.
6. The task execution control method according to claim 5, characterized in that, Based on the topology information and operating status of the on-chip interconnect network, the allocation positions of multiple processes of the software task within the on-chip interconnect network are determined to minimize the cross-node communication cost during the execution of multiple processes, including: In response to information about software tasks to be assigned on the on-chip interconnect network, and based on the amount of access data between multiple processes of the software tasks to be assigned, with the goal of minimizing the cross-node communication cost during the execution of multiple processes, the target node to which the software tasks to be assigned are determined.
7. The task execution control method according to claim 6, characterized in that, Based on the amount of access data between multiple processes of the software task to be assigned, and with the goal of minimizing the cross-node communication cost during the execution of multiple processes, the target node for assigning the software task to be assigned is determined, including: Determine the predicted amount of data accessed between multiple processes of the software task to be assigned; Candidate nodes are determined from the on-chip interconnect network based on the task execution status of the nodes in the on-chip interconnect network; The candidate node that minimizes the cross-node communication cost between multiple processes of the software task to be assigned is selected from the candidate nodes as the target node.
8. The task execution control method according to claim 7, characterized in that, The cross-node communication cost of the software task to be assigned is the sum of the cross-node communication costs of multiple processes. The cross-node communication cost corresponding to a process is the sum of the cross-node access costs of the process accessing other processes of the software task to be assigned. The cross-node communication cost of a process accessing another process is the product of the amount of cross-node access data from the process to the other process and the number of hops between nodes.
9. The task execution control method according to claim 7, characterized in that, Determining the predicted data access volume among multiple processes of the software task to be assigned includes: Based on the amount of data in the memory access instructions whose memory access addresses are not within the address range of the process in which the software task to be assigned is located, determine the first amount of data that each process in the multiple processes of the software task to be assigned accesses each other process. The first data volume is used as the predicted value of the accessed data volume.
10. The task execution control method according to claim 7, characterized in that, The computing nodes of the candidate nodes are all single-core computing nodes; Based on the task execution status of nodes in the on-chip interconnect network, candidate nodes are determined from the on-chip interconnect network, including: Based on the task execution status of the nodes in the on-chip interconnect network, a number of nodes in the on-chip interconnect network that is greater than the number of processes required for the software task to be assigned are determined as candidate nodes; Determining the candidate node from the candidate nodes that minimizes the cross-node communication cost between multiple processes of the software task to be assigned as the target node includes: Determine that the multiple processes of the software task to be assigned are assigned one by one to the candidate node combination corresponding to the candidate node; Calculate the cross-node communication cost corresponding to allocating multiple processes of the software task to be assigned to the candidate node combination; The candidate node in the candidate node combination with the lowest corresponding cross-node communication cost is determined as the target node.
11. The task execution control method according to claim 10, characterized in that, Calculating the cross-node communication cost corresponding to allocating multiple processes of the software task to be assigned to the candidate node combination includes: Determine the allocation pattern for assigning the multiple processes of the software task to be assigned to the candidate nodes in the candidate node combination; Calculate the cross-node communication cost corresponding to each of the allocation modes; The allocation mode with the lowest corresponding cross-node communication cost is determined as the candidate allocation mode corresponding to the candidate node combination, and the corresponding cross-node communication cost is taken as the cross-node communication cost corresponding to the candidate node combination.
12. The task execution control method according to claim 7, characterized in that, Among the candidate nodes, at least one node has a multi-core computing node. Based on the task execution status of nodes in the on-chip interconnect network, candidate nodes are determined from the on-chip interconnect network, including: Based on the task execution status of the nodes in the on-chip interconnect network, candidate nodes are determined from the on-chip interconnect network such that the number of candidate computing cores that the candidate nodes can provide is greater than the number of processes required by the software task to be allocated. Determining the candidate node from the candidate nodes that minimizes the cross-node communication cost between multiple processes of the software task to be assigned as the target node includes: The process of assigning the software task to be assigned is determined to be assigned one by one to the candidate computing core combination corresponding to the candidate computing core; Calculate the cross-node communication cost corresponding to allocating multiple processes of the software task to be assigned to the candidate computing core combination; The candidate node in the candidate computing core combination with the lowest corresponding cross-node communication cost is determined as the target node.
13. The task execution control method according to claim 12, characterized in that, Calculating the cross-node communication cost corresponding to allocating multiple processes of the software task to be assigned to the candidate computing core combination includes: Determine the allocation mode for assigning the multiple processes of the software tasks to be assigned to the candidate computing cores in the candidate computing core combination; Calculate the cross-node communication cost corresponding to each of the allocation modes; The allocation mode with the lowest cross-node communication cost is determined as the candidate allocation mode corresponding to the candidate computing core combination, and the corresponding cross-node communication cost is taken as the cross-node communication cost corresponding to the candidate computing core combination.
14. The task execution control method according to claim 6, characterized in that, Also includes: Before assigning the software task to be assigned to the target node, the performance parameters of the virtual sub-network formed by the target node are configured according to the performance requirements of the software task to be assigned.
15. The task execution control method according to claim 14, characterized in that, Based on the performance requirements of the software tasks to be assigned, configure the performance parameters of the virtual sub-network composed of the target nodes, including: The target execution time of a single cross-node memory access instruction in the software task to be assigned is determined based on the condition that the total execution time of the software task to be assigned is less than or equal to the allowed execution time of the software task to be assigned. Configure the clock frequency of the routing nodes in the virtual subnetwork based on the execution time of the single cross-node memory access instruction.
16. The task execution control method according to claim 15, characterized in that, The total execution time is the sum of the execution time of non-memory access instructions, non-cross-node memory access instructions, and cross-node memory access instructions for the software task to be assigned. The execution time of the non-memory access instruction is determined based on the execution time of a single non-memory access instruction of the software task to be assigned and the number of non-memory access instructions of the software task to be assigned. The execution time of the non-cross-node memory access instruction is determined based on the execution time of a single non-cross-node memory access instruction of the software task to be assigned and the number of non-cross-node memory access instructions of the software task to be assigned. The execution time of the cross-node memory access instruction is determined based on the execution time of a single cross-node memory access instruction of the software task to be assigned and the number of cross-node memory access instructions of the software task to be assigned.
17. The task execution control method according to claim 16, characterized in that, Configure the clock frequency of the routing nodes in the virtual sub-network according to the execution time of the single cross-node memory access instruction, including: The clock frequency of the routing nodes in the virtual sub-network is configured so that the sum of the routing delay time of the software task to be assigned and the execution time of the single non-cross-node memory access instruction is less than the target execution time of the single cross-node memory access instruction. The routing delay time is obtained by dividing the ratio of the total number of cross-node accesses of the software task to be assigned to the number of instructions of the software task to be assigned by the clock frequency of the routing nodes in the virtual sub-network.
18. The task execution control method according to claim 5, characterized in that, Based on the topology information and operating status of the on-chip interconnect network, the allocation positions of multiple processes of the software task within the on-chip interconnect network are determined to minimize the cross-node communication cost during the execution of multiple processes, including: Monitor the amount of data accessed between each pair of nodes in the on-chip interconnect network; When there are two nodes whose access data volume reaches a first threshold, the software task of at least one of the nodes is migrated to shorten the access path between the corresponding nodes.
19. The task execution control method according to claim 18, characterized in that, When two nodes have access data volumes reaching a first threshold, a software task is migrated for at least one of the nodes to shorten the corresponding inter-node access path, including: Determine the first node and the second node from the two nodes whose access data volume reaches the first threshold. The node to be replaced is determined from the on-chip interconnect network, wherein the number of access hops between the node to be replaced and the first node is less than the number of access hops between the second node and the first node; The software tasks of the second node are migrated to the node to be replaced.
20. The task execution control method according to claim 18, characterized in that, The software tasks for migrating the nodes include: Determine the node to be replaced corresponding to the node mentioned above; If the node to be replaced has a running process, the process context content and cached data of the node to be replaced are written into the first storage space to free up the node to be replaced. The process context and cached data of the node are migrated to the vacated node to be replaced, thereby vacating the node; The process context content and cached data that originally belonged to the node to be replaced in the first storage space are migrated to the vacated node.