Computing unit, instruction execution method, processor, device, medium and program

By introducing a connection network within the computing unit to connect multiple computing modules, instruction fusion and parallel execution are achieved, solving the problem of low instruction scheduling efficiency in existing technologies and improving computing performance and applicability.

CN122308919APending Publication Date: 2026-06-30KUNLUNXIN TECHNOLOGY (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
KUNLUNXIN TECHNOLOGY (BEIJING) CO LTD
Filing Date
2026-03-30
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing technologies, the instruction scheduling efficiency of LLM computing units is low, especially in multi-instruction scenarios, where frequent register access and instruction issue bandwidth limit the CUDA core to become a bottleneck in computing power, making it difficult to achieve parallel execution of multiple instructions.

Method used

By introducing a connection network within the computing unit to connect multiple computing modules, instruction fusion and parallel execution are achieved, the target computing module is selected and the data flow is determined, and the instruction scheduling process is optimized.

Benefits of technology

It improves instruction scheduling performance within the computing unit, enhances overall computing performance, reduces hardware complexity and power consumption, and strengthens the applicability and scalability of the computing unit.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122308919A_ABST
    Figure CN122308919A_ABST
Patent Text Reader

Abstract

This disclosure provides a computing unit, instruction execution method, processor, device, medium, and program, relating to the field of computer technology, specifically information processing, deep learning, artificial intelligence, and chip technology. The computing unit is integrated within a processor and includes multiple computing modules connected via a network. The computing unit is used to: receive control signals and vector data generated by decoding a target instruction; select a target computing module from among the computing modules based on the control signals, and determine the data flow direction of the vector data in the target computing module based on the network connection; and sequentially execute the fused instruction calculation operation of the target instruction through each target computing module according to the control signals and the data flow direction of the vector data. Embodiments of this disclosure can improve the performance of instruction scheduling within the computing unit, thereby improving the overall computing performance of the computing unit.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer technology, specifically to information processing, deep learning, artificial intelligence, and chip technology. Background Technology

[0002] As LLM (Large Language Model) and other artificial intelligence technologies begin to demonstrate their capabilities in the multimodal field, LLM is being widely applied in various information processing scenarios such as healthcare, education, and intelligent communication to solve related tasks and accelerate and intelligentize business processes. For all types of artificial intelligence models, processor computing performance is crucial, directly impacting the AI ​​performance of LLM. Summary of the Invention

[0003] This disclosure provides a computing unit, instruction execution method, processor, device, medium, and program that can improve the performance of instruction scheduling within the computing unit, thereby improving the overall computing performance of the computing unit.

[0004] In a first aspect, embodiments of this disclosure provide a computing unit integrated in a processor, comprising multiple computing modules connected via a network, wherein the computing unit is used for: Receive target instructions, decode them, and generate control signals and vector data. The target computing module is selected from each of the computing modules according to the control signal, and the data flow direction of the vector data in the target computing module is determined according to the connection network. According to the data flow direction of the control signal and the vector data in the target calculation module, the fusion instruction calculation operation of the target instruction is executed sequentially through each of the target calculation modules.

[0005] Secondly, embodiments of this disclosure provide an instruction execution method applied to the ALU within a processor, comprising: Receive target instructions, decode them, and generate control signals and vector data. The target computing module is selected from each of the computing modules according to the control signal, and the data flow direction of the vector data in the target computing module is determined according to the connection network. According to the data flow direction of the control signal and the vector data in the target calculation module, the fusion instruction calculation operation of the target instruction is executed sequentially through each of the target calculation modules.

[0006] Thirdly, embodiments of this disclosure provide a processor including at least one computing unit as described in the first aspect.

[0007] Fourthly, embodiments of this disclosure provide an electronic device, including at least one processor as described in the third aspect; and a memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the instruction execution method provided in the second aspect embodiment.

[0008] Fifthly, embodiments of this disclosure also provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the instruction execution method provided in the second aspect embodiment.

[0009] In a sixth aspect, embodiments of this disclosure also provide a computer program product, including a computer program that, when executed by a processor, implements the instruction execution method provided in the second aspect embodiment.

[0010] This disclosure embodiment constructs a computing unit using multiple computing modules connected by a network. This computing unit receives control signals and vector data generated from decoding a target instruction. Based on the control signals, it selects a target computing module from among the various computing modules and determines the data flow direction of the vector data within the target computing module based on the network connection. Furthermore, the computing unit can sequentially execute the fused instruction calculation operation of the target instruction through each target computing module according to the control signals and the data flow direction of the vector data. Therefore, the computing unit provided by this disclosure embodiment integrates multiple computing modules, effectively consolidating hardware computing resources, significantly improving the performance of instruction scheduling within the computing unit, and thus enhancing the overall computing performance of the computing unit.

[0011] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0012] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein: Figure 1 This is a rendering of a typical pipeline design for GPGPU architecture in existing technology; Figure 2 This is a schematic diagram illustrating the typical computational instructions and their access relationships to register files in existing technologies; Figure 3 This is a schematic diagram illustrating the effect of the various instructions provided in the embodiments of this disclosure being implemented in parallel under ideal conditions; Figure 4This is a schematic diagram of the structure of a computing unit provided in an embodiment of this disclosure; Figure 5 This is a schematic diagram of another computing unit provided in an embodiment of this disclosure; Figure 6 This is a schematic diagram illustrating the effect of reconfigurable fusion of computational units in a softmax algorithm implementation provided in this embodiment of the present disclosure; Figure 7 This is a schematic diagram of another computing unit provided in an embodiment of this disclosure; Figure 8 This is a schematic diagram of the structure of a fusion computing unit provided in an embodiment of this disclosure; Figure 9 This is a schematic diagram illustrating the execution of target instructions based on a fusion computing unit, as provided in an embodiment of this disclosure. Figure 10 This is a schematic diagram illustrating the execution of target instructions based on a fusion computing unit, as provided in an embodiment of this disclosure. Figure 11 This is a schematic diagram illustrating the execution of target instructions based on a fusion computing unit, as provided in an embodiment of this disclosure. Figure 12 This is a schematic diagram illustrating the execution of target instructions based on a computing unit, according to an embodiment of this disclosure. Figure 13 This is a flowchart of an instruction execution method provided in an embodiment of this disclosure; Figure 14 This is a structural diagram of a processor provided in an embodiment of the present disclosure; Figure 15 This is a schematic diagram of the structure of an electronic device used to implement the instruction execution method of the embodiments of this disclosure. Detailed Implementation

[0013] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0014] LLM relies on massive computing power, so improving its computational speed is crucial. Most popular LLM models are based on transformer architectures. While matrix operations in these models consume a large portion of the computational power, the introduction of tensor cores (dedicated processing cores designed to accelerate matrix operations in artificial intelligence, deep learning, and high-performance computing) in GPUs (Graphics Processing Units) often makes CUDA (Compute Unified Device Architecture) cores the bottleneck in chip computation.

[0015] Figure 1 This is a rendering of a typical pipeline design for a GPGPU (General Purpose Graphics Processing Unit) architecture in existing technology. In one example, such as... Figure 1 As shown, within the GPGPU architecture, each thread bundle executes the instruction fetch, decode, issue, execute, and write-back processes in a pipelined manner, similar to the classic CPU architecture. However, the entire process is performed in a SIMT (Single Instruction, Multiple Threads, a parallel computing architecture) manner, masking issues such as RAW (Read After Write) by switching between different warps (thread bundles). On the other hand, the ALU (Arithmetic Logic Unit), as the actual computing unit of the processor, often includes many execution units, such as VINTU (vector int type computation unit), VFPU (vector floating-point unit), and SFU (special function computation unit). These vector operations can only achieve full utilization of computing power by being as parallel as possible, which has always been the direction of ALU computing performance optimization efforts.

[0016] Currently in the field of deep learning, due to the complexity of computation, CUDA cores may need to perform various data operations, including addition, multiplication, exponential function (exp) operations, and comparisons. Different instructions will call different computing resources. In existing technologies, the computational unit (ALU) suffers from low instruction execution efficiency in scenarios with many computational instructions due to factors such as instruction scheduling, frequent register access, bandwidth limitations of register file access, and instruction issue bandwidth. It is difficult to achieve parallelism between the ALU and load / store operations. For example, this performance degradation is particularly pronounced in the typical scenario of softmax (normalized exponential function), making CUDA a bottleneck in computing power.

[0017] Taking the softmax algorithm implementation as an example, some instructions of the softmax algorithm are mapped to computation instructions in the CUDA core. Each thread actually needs to complete the processes of Load, sub (subtraction), exp, add (addition), and store. The implementation process of the softmax algorithm also intersperses some address calculations or branch instructions, but these calculations or instructions can be ignored.

[0018] Figure 2 This is a schematic diagram illustrating the typical access relationship between computation instructions and register files in existing technology, for example, as shown below. Figure 2 As shown, if the instructions sub, exp, and add need to be executed in parallel, the bandwidth required to access the VRF (Vector Register File) is very large. However, due to the limitation of VRF bandwidth, the current architecture cannot achieve parallel execution of these instructions.

[0019] Figure 3 This is a schematic diagram illustrating the effect of the various instructions provided in the embodiments of this disclosure being implemented in parallel under ideal conditions. For example, as shown... Figure 3 As shown, the five instructions—load, fma (fused multiply-accumulate), exp, add, and store—must be executed in parallel to avoid impacting performance. This means that the instruction issuing module needs to issue all five instructions simultaneously to achieve instruction parallelism. Due to factors such as multiple warps, hardware complexity, and area constraints, this issuance bandwidth is virtually impossible to achieve in hardware implementation. Furthermore, since instruction scheduling is required across multiple warps, simultaneously scheduling these five instructions across different warps is also difficult. Therefore, it is evident that achieving parallel execution of multiple instructions is challenging in the existing ALU architecture.

[0020] In one example Figure 4 This is a schematic diagram of the structure of a computing unit provided in an embodiment of this disclosure. This computing unit can be integrated into a processor. Figure 4 As shown, the structure of the computing unit includes multiple computing modules 410, and each computing module 410 is connected through a connection network 420.

[0021] The computing unit can be used to: receive control signals and vector data generated by decoding the target instruction; select a target computing module from each computing module 410 according to the control signal, and determine the data flow direction of the vector data in the target computing module according to the connection network 420; and execute the fusion instruction calculation operation of the target instruction sequentially through each target computing module according to the control signal and the data flow direction of the vector data in the target computing module.

[0022] The calculation module 410 can be a module within the calculation unit that provides calculation functions. Examples include, but are not limited to, conversion modules, bypass modules, multiplexing modules, addition modules, subtraction modules, multiplication modules, fused multiplication-addition modules, branch judgment modules, logical operation modules, and exponential operation modules, as long as they can provide corresponding data processing and / or calculation functions. This disclosure does not limit the specific type of the calculation module. The connection network 420 can be used to connect the various calculation modules 410 within and between calculation units. The target instruction can be the instruction that the calculation unit currently needs to process. Optionally, the target instruction can be a single type of instruction, or it can include instructions with multiple different functional operation requirements. This disclosure does not limit the instruction type and function of the target instruction. The target calculation module can be selected from the various calculation modules of the calculation unit based on the decoding result of the target instruction, and can currently be used to process some or all of the target instructions. Fusion instruction calculation operation refers to merging multiple independent instructions or calculation steps into a more efficient single calculation operation.

[0023] For example, such as Figure 4 As shown, the computing unit, composed of computing modules 410 connected via the network 420, can independently complete the fusion instruction calculation operation. It should be noted that... Figure 4 In this context, computation module 1, computation module n, and computation module m are used only to represent the individual computation modules within a computation unit. That is, computation module 1, computation module n, and computation module m can be completely different computation modules, or they can be partially or completely identical computation modules. Furthermore, Figure 4 The connection network 420 is only used to schematically illustrate the connection method between the computing modules. The connection method between the computing modules can be configured according to actual needs. For example, computing module 1 may include 3 outputs, computing module n may include 2 inputs, and computing module 1 and computing module m may be connected in parallel. That is, the embodiments of this disclosure do not limit the number of computing modules included in the computing unit, the type of computing modules, or the specific connection method between the computing modules.

[0024] In one example Figure 5 This is a schematic diagram of another computing unit provided in an embodiment of this disclosure, exemplarily, as shown below. Figure 5 As shown, different computing units can also be connected via a network 420 to jointly complete instruction fusion and computation operations. It should be noted that... Figure 5In this context, computation module 1, computation module n, and computation module m are used only to represent the computation modules within computation unit 1, and computation module n1, computation module nn, and computation module nm are used only to represent the computation modules within computation unit n. That is, the computation modules included in each computation unit can be distinct, partially identical, or entirely identical. Furthermore, Figure 4 The connection network 420 is used only to schematically illustrate the connection methods between various computing modules and computing units. The connection methods between computing modules within each computing unit can be configured according to actual needs. For example, computing module 1 within a computing unit may include 3 outputs, computing module n may include 2 inputs, and computing module 1 and computing module m may be connected in parallel. The computing modules included in different computing units may be partially the same or completely different. Similarly, the connection networks configured for the internal computing modules of each computing unit may be the same, partially the same, or different. Furthermore, when the number of computing units used to process the target instruction is at least two, the connection methods between different computing units can also be configured according to actual needs. For example, computing unit 1 and computing unit 2 may be connected through one or more computing modules. A computing unit may have a connection relationship with one or more other computing units. That is, when the number of computing units used to process the target instruction is multiple, the embodiments of this disclosure do not limit the number of computing modules included in each computing unit, the type of computing modules, or the specific connection methods between computing modules, nor do they limit the connection methods between computing units.

[0025] Correspondingly, after receiving the target instruction, the processor can perform decoding processing through the instruction decoder within the controller. Specifically, the instruction decoder analyzes and decodes the opcode of the target instruction to determine the specific operation type of the target instruction, and then generates the corresponding control signals and vector data to coordinate components such as the ALU and registers to complete the instruction execution. The controller can send the control signals and vector data generated by the target instruction decoding process to the corresponding computing units. The control signals for the target instruction can select the ALU and vector data required by each computing unit, and determine the combination of ALUs selected by each computing unit. The vector data is the computational data required for the target instruction's operation.

[0026] Optionally, if there is only one computing unit, this unit can directly select the required computing module as the target computing module from among the computing modules according to the control signal, and determine the specific data flow direction of the vector data corresponding to the target instruction in each target computing module according to the connection network between the computing modules. Furthermore, the computing unit can control the computing order of each target computing module in the entire computing process and the data flow direction of the vector data in each target computing module according to the control signal, thereby realizing the sequential execution of the fused instruction computing operation of the target instruction through each target computing module, completing the single-instruction multi-functional computing process of the target instruction.

[0027] Optionally, if there are multiple computing units, each computing unit can determine its processing order for relevant vector data based on control signals. Accordingly, after determining the processing order, each computing unit (which may have one or more units) can select a desired computing module from its internal modules as the target computing module based on control signals, and determine the specific data flow of the vector data corresponding to the target instruction within each target computing module based on the connection network between the modules. If there are multiple computing units currently performing the calculation, i.e., multiple computing units are performing parallel calculations, each computing unit can execute in parallel. Furthermore, the computing unit currently performing the calculation can control the calculation order of each target computing module within it and the data flow of vector data within each target computing module based on control signals, thereby enabling the sequential execution of partial fusion instruction calculation operations of the target instruction through each internal target computing module until all computing units jointly complete the single-instruction, multi-functional calculation process of the target instruction.

[0028] Figure 6 This is a schematic diagram illustrating the effect of reconfigurable fusion of computational units in a softmax algorithm implementation provided by an embodiment of this disclosure. In a specific example, such as... Figure 6As shown, taking the softmax algorithm execution process as an example, optionally, different ALUs (such as fma, cvt (data type conversion instruction), exp, and cmp (compare instruction)) can be reconfigurably fused within a single computing unit according to application scenario requirements using the FUSED_VFPU (Fused Vector Floating-Point Unit) module. These ALUs are then connected via a CONNECT NETWORK to achieve single-instruction multi-functionality, meaning one input data can perform multiple different computational functions. Specifically, the VEC_DISPATCH (vector interrupt dispatch component) transmits control signals and vector data to the FUSED_VFPU module based on the decoding result of the target instruction. The vector data of the target instruction can be loaded from memory into the VRF. The FUSED_VFPU module determines the selected ALU within the computing unit and the overall data flow direction of the connection network based on the control signals, ultimately outputting the computation result of the target instruction to the VRF. For example, as shown... Figure 6 As shown, through ALU instruction fusion, the original five instructions involved in the softmax algorithm execution process—load, fma, exp, add, and store—can be reduced to three instructions: load, sub_exp_add, and store. Specifically, sub_exp_add can fuse the original sub, exp, and add instructions into a single instruction.

[0029] It is understandable that different target instructions will result in different control signals and vector data generated through decoding. Consequently, the target computation modules selected based on the control signals of the target instructions may also differ, leading to different data flows of vector data within the target computation modules. For example, target instruction A might select a fusion multiply-accumulate module and an exponentiation module in computation unit 1, while in computation unit 2 it might select a fusion multiply-accumulate module. Similarly, target instruction B might select a fusion multiply-accumulate module in computation unit 1, while in computation unit 2 it might select a fusion multiply-accumulate module and a multiplexing module. Based on the target computation modules, and with the constraint control of the control signals, the target computation modules can also implement various different computational functions. For instance, the fusion multiply-accumulate module can perform fusion multiply-accumulate calculations on three types of input data, or addition and subtraction operations on two types of input data. In other words, according to the computational requirements of the instructions, the computational modules within the computation unit can be reconfigurably fused to meet the needs of different instruction scheduling.

[0030] Therefore, the above technical solution, by improving the internal structure of the computing unit and rationally integrating ALUs, allows for the reuse of all computing resources and the implementation of various functions. This enables the highest ALU reuse rate with minimal area loss, achieving fine-grained control over the issuance of complex fused instructions, resulting in fully pipelining instructions. This improves the computing unit's performance, reduces power consumption, and minimizes memory accesses. Simultaneously, during computation, the computing unit selects the target computing module from various modules as needed based on control signals, determines the data flow direction of vector data within the target computing module based on the connection network, and then sequentially executes the fused instruction computation operation of the target instructions through each target computing module according to the control signals and the data flow direction of vector data. This allows a single computing unit to meet the computational needs of multiple different types of fused instructions, thereby significantly improving the multi-faceted performance of instruction scheduling within the computing unit, and ultimately enhancing the applicability and overall computing performance of the computing unit. The fused instructions can be translated at the processor's compiler level, naturally compatible with CUDA, and are seamlessly integrated into the software. Therefore, instruction fusion can reduce the number of instructions with minimal hardware cost, thereby reducing instruction issue bandwidth, VRF access bandwidth, and performance degradation caused by multi-instruction scheduling.

[0031] This disclosure embodiment constructs a computing unit using multiple computing modules connected by a network. This computing unit receives control signals and vector data generated from decoding a target instruction. Based on the control signals, it selects a target computing module from among the various computing modules and determines the data flow direction of the vector data within the target computing module based on the network connection. Furthermore, the computing unit can sequentially execute the fused instruction calculation operation of the target instruction through each target computing module according to the control signals and the data flow direction of the vector data. Therefore, the computing unit provided by this disclosure embodiment integrates multiple computing modules, effectively consolidating hardware computing resources, significantly improving the performance of instruction scheduling within the computing unit, and thus enhancing the overall computing performance of the computing unit.

[0032] In one example Figure 7 This is a schematic diagram of another computing unit provided in an embodiment of this disclosure, such as... Figure 7 As shown, the computing unit may include a first computing unit and a second computing unit, and the two computing units work together to complete the fusion processing flow of the target instruction.

[0033] Specifically, the first computing unit is used to: according to the data flow direction of the vector data in the target computing module of the first computing unit, sequentially execute the fusion instruction calculation operation of the target instruction through the target computing module of the first computing unit to obtain the intermediate instruction fusion calculation result; the second computing unit is used to: according to the data flow direction of the vector data in the target computing module of the second computing unit and the intermediate instruction fusion calculation result, sequentially execute the fusion instruction calculation operation of the target instruction through the target computing module of the second computing unit to obtain the target instruction fusion calculation result.

[0034] The intermediate instruction fusion calculation result can be an intermediate result calculated and output by the first calculation unit through the selected target calculation modules. The target instruction fusion calculation result can be the final calculation result of the target instructions calculated and output by the second calculation unit through the selected target calculation modules.

[0035] In this embodiment, optionally, two computing units can be used to jointly process the fusion instruction calculation operation corresponding to the target instruction. The first computing unit processes a portion of the fusion instruction calculation operation of the target instruction and outputs the calculation result corresponding to that portion as the intermediate instruction fusion calculation result. The second computing unit processes the remaining portion of the fusion instruction calculation operation of the target instruction and outputs the calculation result corresponding to that portion as the target instruction fusion calculation result. The first and second computing units can be connected via a network and certain computing modules within the computing units to enable data exchange between them.

[0036] Specifically, the first computing unit can, based on the data flow of vector data in its target computing module, sequentially execute the fusion instruction calculation operation of the target instruction through the target computing module of the first computing unit, thereby obtaining the intermediate instruction fusion calculation result. The intermediate instruction fusion calculation result can be output separately to determine the accuracy of the calculation process of the first computing unit. Simultaneously, the first computing unit can also output to a target computing module of the second computing unit. Correspondingly, the second computing unit can, based on the data flow of vector data in its target computing module and the intermediate instruction fusion calculation result received from the first computing unit, sequentially execute the fusion instruction calculation operation of the target instruction through its target computing module, thereby obtaining the target instruction fusion calculation result. It is understood that different target instruction contents will result in different intermediate instruction fusion calculation results output by the first computing unit, and also different target instruction fusion calculation results output by the second computing unit.

[0037] Since each computing unit can flexibly select computing modules to participate in the computation as needed, the computation process of the target instruction can be completed by using two computing units in conjunction, which can provide a richer and more reconfigurable data computation method. This further improves the scalability, applicability and reusability of the computing unit, and further optimizes the instruction scheduling performance of the computing unit.

[0038] In one example Figure 8 This is a schematic diagram of the structure of a fusion computing unit provided in an embodiment of this disclosure, as shown below. Figure 8 As shown, the first computing unit may further include a first conversion bypass module, a second conversion bypass module, a first multiplexing module and a second multiplexing module, and the computing module of the first computing unit may include a first fusion multiply-accumulate module and an exponentiation operation module.

[0039] The first and second conversion bypass modules can be circuit modules that include both data conversion (CVT) and bypass (BYPASS) functions. That is, they can be CVT / BYPASS modules. The first and second multiplexing modules can be modules that include multiplexing functions, such as MUX (Multiplexer). The first fused multiply-accumulate module can be a module with fused multiply-accumulate functions, which can be abbreviated as FMA. The exponentiation module can be a module with exponential function arithmetic functions, which can be abbreviated as EXP.

[0040] like Figure 8 As shown, the input terminal of the first conversion bypass module can receive multiple input signals, for example, three input signals. The output terminal of the first conversion bypass module is connected to the input terminal of the first fusion multiply-accumulate module, and also connected to the input terminal of the first multiplexing module. It is used to select a first target input signal from the multiple input signals according to a control signal, and send the first target input signal to the first fusion multiply-accumulate module. The first target input signal can be an input signal that needs to be processed by the first fusion multiply-accumulate module, and can be part or all of the input signals received by the first conversion bypass module. The output terminal of the first fusion multiply-accumulate module is connected to the input terminals of the first and second multiplexing modules. It is used to perform fusion multiply-accumulate calculation on the first target input signal according to a control signal, obtain a first fusion multiply-accumulate calculation result, and send the first fusion multiply-accumulate calculation result to the first or second multiplexing module according to a control signal. It should be noted that the first fusion multiplication and addition module can perform fusion multiplication and addition calculations on the first target input signal under the control of the control signal, and can also perform multiple calculation methods such as adding or subtracting two numbers, and fusion multiplication and addition of three numbers.

[0041] like Figure 8 As shown, the output of the first multiplexing module is connected to the input of the exponentiation module, and is used to send the output signal of the first conversion bypass module or the output signal of the first fused multiply-accumulate calculation result to the exponentiation module. Optionally, if the first multiplexing module is selected as one of the target calculation modules, it can send the received result to the exponentiation module. If the first multiplexing module is not selected as one of the target calculation modules, it has no output during the execution of the target instruction. The output of the exponentiation module is connected to the input of the second multiplexing module, and is used to perform exponentiation based on the received signal when selected as a target calculation module, obtain the exponentiation result, and send the exponentiation result to the second multiplexing module. The input terminal of the second multiplexing module is also connected to the output terminal of the first input signal (r0) of the first conversion bypass module. The output terminal of the second multiplexing module is connected to the input terminal of the second conversion bypass module and the input terminal of the third multiplexing module of the second computing unit. This allows the received input signal to be sent to the second conversion bypass module and the third multiplexing module of the second computing unit, enabling the output of intermediate instruction fusion calculation results and data interaction between the first and second computing units. Specifically, the second conversion bypass module generates intermediate instruction fusion calculation results based on the input signal. For example, the second conversion bypass module can perform data precision conversion or direct output on the input signal to generate intermediate instruction fusion calculation results.

[0042] The above technical solution utilizes modules with conversion and bypass functions, a fusion multiplication-addition module, an exponential operation module, and a multiplexing module, which are connected together to form a first computing unit. This unit can realize multiple computing functions such as fusion multiplication-addition, subtraction, exponential function operations, and addition, thereby improving the computing performance of the first computing unit.

[0043] In one alternative embodiment of this disclosure, such as Figure 8 As shown, the second computing unit may further include a third conversion bypass module, a fourth conversion bypass module, a third multiplexing module, and a fourth multiplexing module. The computing module of the second computing unit includes a second fusion multiply-accumulate module.

[0044] The third and fourth conversion bypass modules can be circuit modules that include both data conversion (CVT) and bypass (BYPASS) functions. That is, they can be CVT / BYPASS modules. The third and fourth multiplexing modules can be modules that include multiplexing functions, such as MUX. The second fusion multiply-accumulate module can be a module with fusion multiply-accumulate functionality, which can be abbreviated as FMA.

[0045] like Figure 8 As shown, the input terminal of the third conversion bypass module is used to receive multiple input signals, for example, it can receive three input signals. The output terminal of the third conversion bypass module is connected to the input terminal of the second fusion multiply-accumulate module, and the output terminal of the third conversion bypass module is also connected to the input terminal of the third multiplexing module. It is used to select a second target input signal from the multiple input signals according to a control signal and send the second target input signal to the second fusion multiply-accumulate module. The second target input signal can be an input signal that needs to be processed by the second fusion multiply-accumulate module, and can be part or all of the input signals received by the third conversion bypass module. The input terminal of the third multiplexing module is connected to the output terminal of the third conversion bypass module and the output terminal of the second multiplexing module of the first computing unit. The output terminal of the third multiplexing module is connected to the input terminal of the second fusion multiply-accumulate module, and it is used to send the received signal to the second fusion multiply-accumulate module. The output of the second fusion multiply-add module is connected to the input of the fourth multiplexing module. It performs fusion multiply-add calculations on the second target input signal according to the control signal, obtains the second fusion multiply-add calculation result, and sends the output signal of the second fusion multiply-add calculation result to the fourth multiplexing module. It should be noted that the fusion multiply-add module's calculation process on the second target input signal, under the control of the control signal, can perform addition or subtraction of two numbers, as well as fusion multiply-add of three numbers, and other calculation methods.

[0046] like Figure 8 As shown, the input of the fourth multiplexing module is also connected to the output of the third conversion bypass module, and the output of the fourth multiplexing module is connected to the input of the fourth conversion bypass module, for sending the received signal to the fourth conversion bypass module. The fourth conversion bypass module can then generate the target instruction fusion calculation result based on the input signal. For example, the fourth conversion bypass module can perform data precision conversion or direct output on the input signal to generate the target instruction fusion calculation result.

[0047] In the above scheme, the conversion bypass module and multiplexing module of each computing unit can serve as auxiliary functional modules. However, since they also possess data processing functions, such as data conversion and bypass operations, they can also serve as special computing modules within the computing unit. This disclosure does not impose any limitations on this. Each module within the first and second computing units can be partially or fully selected as target computing modules to participate in the fusion computing process of target instructions. The target computing modules selected by the first and second computing units can be reconfigured and combined as needed to complete instruction scheduling for different functions.

[0048] The above technical solution utilizes modules with conversion and bypass functions, a fusion multiplication-addition module, and a multiplexing module to jointly connect the network to form a second computing unit. At the same time, it interacts with the first computing unit to realize multiple computing functions such as fusion multiplication-addition, subtraction, exponential function operations, and addition. This improves the computing performance of the second computing unit and enhances the overall joint computing performance of the first and second computing units.

[0049] Figure 9 This is a schematic diagram illustrating the execution of target instructions based on a fusion computing unit, as provided in an embodiment of this disclosure. In an optional embodiment of this disclosure, such as... Figure 9 As shown, the first calculation unit can also be used to: control the first conversion bypass module to select a first input signal and a second input signal according to the control signal, and send the first input signal and the second input signal to the first fusion multiply-accumulate module; control the first fusion multiply-accumulate module to perform subtraction calculation on the first input signal and the second input signal according to the control signal to obtain the first fusion multiply-accumulate calculation result; send the first fusion multiply-accumulate calculation result to the first multiplexing module according to the control signal, and control the first multiplexing module to send the first fusion multiply-accumulate calculation result to the exponentiation module; and perform exponentiation operation on the first fusion multiply-accumulate calculation result through the exponentiation module to obtain the exponentiation result. The second calculation unit is further configured to: control the third conversion bypass module to select a sixth input signal according to the control signal, and send the sixth input signal and the exponential running result to the second fusion multiply-accumulate module; calculate the sum between the sixth input signal and the exponential running result through the second fusion multiply-accumulate module to obtain the second fusion multiply-accumulate calculation result; and send the second fusion multiply-accumulate calculation result to the fourth conversion bypass module according to the control signal, so that the fourth conversion bypass module can output the target instruction fusion calculation result based on the second fusion multiply-accumulate calculation result.

[0050] like Figure 9As shown, assuming the target instruction is exp(AB)+C, when the first and second calculation units jointly process this type of target instruction, the first and second input signals can be two input signals received by the first conversion bypass module, such as input signals r0 and r1. Correspondingly, the first conversion bypass module uses the BYPASS function to send the first input signal r0 and the second input signal r1 to the first fusion multiply-accumulate module. Further, the first calculation unit can control the first fusion multiply-accumulate module to perform subtraction calculations on the first input signal r0 and the second input signal r1 according to the control signal, obtaining the first fusion multiply-accumulate calculation result. That is, the calculation principle of the first fusion multiply-accumulate module is... The first calculation unit can control the value of "a" to be 1, the value of "b" to be r0, and the value of "c" to be -r1 according to the control signal. Therefore, the first fused multiply-accumulate calculation result can be r0-r1. Further, the first calculation unit sends the first fused multiply-accumulate calculation result "r0-r1" to the first multiplexing module according to the control signal, and controls the first multiplexing module to send the first fused multiply-accumulate calculation result "r0-r1" to the exponentiation module. The first calculation unit can perform exponentiation on the first fused multiply-accumulate calculation result "r0-r1" through the exponentiation module to obtain the exponentiation result "exp(r0-r1)". Further, the first calculation unit sends the exponentiation result "exp(r0-r1)" to the output of the second conversion bypass module through the second multiplexing module. For example, the exponentiation result "exp(r0-r1)" can be processed by data conversion (CVT) to achieve precision conversion, thereby obtaining the final intermediate instruction fused calculation result "exp(r0-r1)". The precision of the intermediate instruction fusion calculation result and the first fusion multiply-accumulate calculation result can be different.

[0051] like Figure 9 As shown, the second calculation unit can control the third conversion bypass module to select the sixth input signal, r5, via BYPASS according to the control signal, and send the sixth input signal r5 and the exponential running result "exp(r0-r1)" to the second fusion multiply-accumulate module. Correspondingly, the second calculation unit can calculate the sum between the sixth input signal r5 and the exponential running result "exp(r0-r1)" through the second fusion multiply-accumulate module to obtain the second fusion multiply-accumulate calculation result. Since the calculation principle of the second fusion multiply-accumulate module is also... The second calculation unit can control the value of "a" to be 1, the value of "b" to be exp(r0-r1), and the value of "c" to be r5 according to the control signal. Therefore, the second fused multiply-accumulate calculation result can be "exp(r0-r1)+r5". The second calculation unit can send the second fused multiply-accumulate calculation result "exp(r0-r1)+r5" to the fourth conversion bypass module according to the control signal. The fourth conversion bypass module can then output the target instruction fused calculation result based on the second fused multiply-accumulate calculation result "exp(r0-r1)+r5". For example, the fourth conversion bypass module can use the BYPASS function to directly output the second fused multiply-accumulate calculation result "exp(r0-r1)+r5" as the target instruction fused calculation result.

[0052] like Figure 9 As shown, in the first computing unit, the first conversion bypass module, the first fusion multiply-accumulate module, the first multiplexing module, the exponentiation module, the second multiplexing module, and the second conversion bypass module are target computing modules selected based on control signals. In the second computing unit, the third conversion bypass module, the second fusion multiply-accumulate module, the third multiplexing module, the fourth multiplexing module, and the fourth conversion bypass module are target computing modules selected based on control signals.

[0053] When the above technical solution processes and executes target instructions of type exp(AB)+C in the processor, the first and second computing units can select the target computing module from each computing module according to the control signal of the target instruction exp(AB)+C, and determine the data flow direction of vector data in the target computing module according to the connection network, so as to execute the fusion instruction calculation operation of the target instruction sequentially through the target computing modules of each computing unit. It can simultaneously output the intermediate instruction fusion calculation result of exp(AB) and the target instruction fusion calculation result of exp(AB)+C, thereby improving the reuse rate and computing performance of computing units.

[0054] Figure 10 This is a schematic diagram illustrating the execution of target instructions based on a fusion computing unit, as provided in an embodiment of this disclosure. In an optional embodiment of this disclosure, such as... Figure 10As shown, the first calculation unit can also be used to: control the first conversion bypass module to select a first input signal, a second input signal, and a third input signal according to the control signal, and send the first input signal, the second input signal, and the third input signal to the first fusion multiply-accumulate module; control the first fusion multiply-accumulate module to perform multiplication calculation on the first input signal and the second input signal according to the control signal, and calculate the sum of the multiplication result and the third input signal to obtain the first fusion multiply-accumulate calculation result; send the first fusion multiply-accumulate calculation result to the second multiplexing module according to the control signal, and control the second multiplexing module to send the first fusion multiply-accumulate calculation result to the second conversion bypass module and the third conversion bypass module, so that the second conversion bypass module can process the calculation result according to the control signal. The first fusion multiply-accumulate calculation result outputs the intermediate instruction fusion calculation result; the second calculation unit is further configured to: control the third conversion bypass module to select a fourth input signal and a fifth input signal according to the control signal, and send the fourth input signal, the fifth input signal and the first fusion multiply-accumulate calculation result to the second fusion multiply-accumulate module; calculate the first product value between the fourth input signal and the fifth input signal through the second fusion multiply-accumulate module, and calculate the sum of the first product value and the first fusion multiply-accumulate calculation result to obtain the second fusion multiply-accumulate calculation result; send the second fusion multiply-accumulate calculation result to the fourth conversion bypass module according to the control signal, so that the fourth conversion bypass module outputs the target instruction fusion calculation result according to the second fusion multiply-accumulate calculation result.

[0055] like Figure 10 As shown, assuming the target instruction is When the first and second computing units jointly process this type of target instruction, the first, second, and third input signals can be three input signals received by the first conversion bypass module, such as input signals r0, r1, and r2. Correspondingly, the first conversion bypass module uses the BYPASS function to send the first input signal r0, the second input signal r1, and the third input signal r2 to the first fusion multiply-accumulate module. Further, the first computing unit can control the first fusion multiply-accumulate module to perform fusion multiply-accumulate calculations on the first input signal r0, the second input signal r1, and the third input signal r2 according to control signals, obtaining the first fusion multiply-accumulate calculation result. Since the calculation principle of the first fusion multiply-accumulate module is... The first calculation unit can control the value of "a" to be r0, the value of "b" to be r1, and the value of "c" to be r2 according to the control signal. Therefore, the specific result of the first fused multiplication-addition calculation can be... Furthermore, the first calculation unit, based on the control signal, calculates the first fused multiply-accumulate result... "Send to the second multiplexing module and control the second multiplexing module to perform the first fused multiply-accumulate calculation result." "Sent to the second and third conversion bypass modules, so that the second conversion bypass module can output the intermediate instruction fusion calculation result based on the first fusion multiply-accumulate calculation result. For example, the second conversion bypass module processes the first fusion multiply-accumulate calculation result..." "First, a data transformation (CVT) process is performed to achieve precision conversion, thereby obtaining the final intermediate instruction fusion calculation result." The precision of the intermediate instruction fusion calculation result and the first fusion multiplication-addition calculation result can be different.

[0056] like Figure 10 As shown, the second calculation unit can control the third conversion bypass module to select the fourth input signal r3 and the fifth input signal r4 via BYPASS according to the control signal, and then combine the fourth input signal r3, the fifth input signal r4, and the first fusion multiplication-addition calculation result. "Sent to the second fusion multiply-accumulate module. Accordingly, the second calculation unit can use the second fusion multiply-accumulate module to calculate the fourth input signal r3, the fifth input signal r4, and the first fusion multiply-accumulate result r0." The fusion multiplication and addition operation is performed using "r1+r2", resulting in the second fusion multiplication and addition calculation. Since the calculation principle of the second fusion multiplication and addition module is also a... b+c, the second calculation unit can control the value of "a" to be r3, the value of "b" to be r4, and the value of "c" to be r0 according to the control signal. Then, the result of the second fusion multiplication and addition calculation can be "r3". r4+r0 r1+r2”. The second calculation unit can, according to the control signal, calculate the second fusion multiplication-accumulation result “r3”. r4+r0 "r1+r2" is sent to the fourth conversion bypass module, so that the fourth conversion bypass module can calculate "r3" based on the second fusion multiplication-addition result. r4+r0 "r1+r2" outputs the target instruction fusion calculation result. For example, the fourth conversion bypass module can use the BYPASS function to directly output the second fusion multiplication-addition calculation result "r3". r4+r0 "r1+r2" is the result of the target instruction fusion calculation.

[0057] like Figure 10As shown, in the first computing unit, the first conversion bypass module, the first fusion multiply-accumulate module, the second multiplexing module, and the second conversion bypass module are target computing modules selected based on control signals. In the second computing unit, the third conversion bypass module, the second fusion multiply-accumulate module, the third multiplexing module, the fourth multiplexing module, and the fourth conversion bypass module are target computing modules selected based on control signals.

[0058] The above technical solution processes and executes A in the processor. X+B When dealing with a target instruction of type Y+C, the first and second calculation units can calculate the target instruction A based on the target instruction A. X+B The Y+C control signal selects the target computing module from each computing module and determines the data flow direction of the vector data in the target computing module according to the connection network. This allows the target computing module to sequentially execute the fusion instruction calculation operation of the target instruction through each computing unit, and can simultaneously output B. The intermediate instruction fusion calculation results of Y+C and A X+B The Y+C target instruction fusion calculation results improve the reuse rate and computing performance of computing units.

[0059] Figure 11 This is a schematic diagram illustrating the execution of target instructions based on a fusion computing unit, as provided in an embodiment of this disclosure. In an optional embodiment of this disclosure, such as... Figure 11As shown, the first calculation unit may further be used to: control the first conversion bypass module to select a first input signal and a third input signal according to the control signal, and send the first input signal and the third input signal to the first fusion multiply-accumulate module; control the first fusion multiply-accumulate module to perform addition calculation on the first input signal and the third input signal according to the control signal to obtain the first fusion multiply-accumulate calculation result; send the first fusion multiply-accumulate calculation result to the second multiplexing module according to the control signal, and control the second multiplexing module to send the first fusion multiply-accumulate calculation result to the second conversion bypass module and the third conversion bypass module, so that the second conversion bypass module outputs the intermediate [processor] based on the first fusion multiply-accumulate calculation result. The instruction fusion calculation result; the second calculation unit is further configured to: control the third conversion bypass module to select a fourth input signal and a sixth input signal according to the control signal, and send the fourth input signal, the sixth input signal and the first fusion multiplication and addition calculation result to the second fusion multiplication and addition module; calculate the second product value between the fourth input signal and the first fusion multiplication and addition calculation result through the second fusion multiplication and addition module, and calculate the sum of the second product value and the sixth input signal to obtain the second fusion multiplication and addition calculation result; send the second fusion multiplication and addition calculation result to the fourth conversion bypass module according to the control signal, so that the fourth conversion bypass module can output the target instruction fusion calculation result according to the second fusion multiplication and addition calculation result.

[0060] like Figure 11 As shown, assume the target instruction is (X+A). When the first and second calculation units jointly process the target instruction Y+B, the first and third input signals can be two input signals received by the first conversion bypass module, such as input signals r0 and r2. Correspondingly, the first conversion bypass module uses the BYPASS function to send the first input signal r0 and the third input signal r2 to the first fusion multiply-accumulate module. Further, the first calculation unit can control the first fusion multiply-accumulate module to perform addition calculations on the first input signal r0 and the third input signal r2 according to the control signal, obtaining the first fusion multiply-accumulate calculation result. Since the calculation principle of the first fusion multiply-accumulate module is a... In the first calculation unit, the value of "a" can be controlled by the control signal to be r0, the value of "b" to be 1, and the value of "c" to be r2. Therefore, the first fused multiply-accumulate calculation result can be r0 + r2. Further, the first calculation unit sends the first fused multiply-accumulate calculation result "r0 + r2" to the second multiplexing module according to the control signal, and controls the second multiplexing module to send the first fused multiply-accumulate calculation result "r0 + r2" to the second conversion bypass module and the third conversion bypass module. The second conversion bypass module then outputs the intermediate instruction fused calculation result based on the first fused multiply-accumulate calculation result. For example, the second conversion bypass module first performs data conversion (CVT) processing on the first fused multiply-accumulate calculation result "r0 + r2" to achieve precision conversion, thereby obtaining the final intermediate instruction fused calculation result "r0 + r2". The data precision of the intermediate instruction fused calculation result and the first fused multiply-accumulate calculation result can be different.

[0061] like Figure 11 As shown, the second calculation unit can control the third conversion bypass module to select the fourth input signal r3 and the sixth input signal r5 via BYPASS according to the control signal, and send the fourth input signal r3, the sixth input signal r5 and the first fusion multiply-accumulate calculation result "r0+r2" to the second fusion multiply-accumulate module. Correspondingly, the second calculation unit can perform fusion multiply-accumulate calculation on the fourth input signal r3, the sixth input signal r5 and the first fusion multiply-accumulate calculation result "r0+r2" through the second fusion multiply-accumulate module to obtain the second fusion multiply-accumulate calculation result. Since the calculation principle of the second fusion multiply-accumulate module is also a... b+c, the second calculation unit can control the value of "a" to be r3, the value of "b" to be r0+r2, and the value of "c" to be r5 according to the control signal. Then the result of the second fused multiplication-addition calculation can be "r3". (r0+r2)+r5". The second calculation unit can, based on the control signal, calculate the second fusion multiplication-accumulation result "r3". (r0+r2)+r5” is sent to the fourth conversion bypass module, so that the fourth conversion bypass module can calculate the result of the second fusion multiplication-addition, “r3”. (r0+r2)+r5” outputs the target instruction fusion calculation result. For example, the fourth conversion bypass module can use the BYPASS function to directly output the second fusion multiplication-addition calculation result “r3”. (r0+r2)+r5” is the result of the target instruction fusion calculation.

[0062] like Figure 11As shown, in the first computing unit, the first conversion bypass module, the first fusion multiply-accumulate module, the second multiplexing module, and the second conversion bypass module are target computing modules selected based on control signals. In the second computing unit, the third conversion bypass module, the second fusion multiply-accumulate module, the third multiplexing module, the fourth multiplexing module, and the fourth conversion bypass module are target computing modules selected based on control signals.

[0063] The above technical solution is processed and executed in the processor (X+A). When dealing with a target instruction of type Y+B, the first and second calculation units can calculate the target instruction (X+A). The Y+B control signal selects the target computing module from each computing module and determines the data flow direction of the vector data in the target computing module according to the connection network. This allows the target computing module of each computing unit to execute the fusion instruction calculation operation of the target instruction in sequence, and can simultaneously output the intermediate instruction fusion calculation result of X+A and (X+A). The Y+B target instruction fusion calculation results improve the reuse rate and computing performance of the computing unit.

[0064] In one example Figure 12 This is a schematic diagram illustrating the execution of target instructions based on a computing unit, provided by an embodiment of this disclosure. In an optional embodiment of this disclosure, such as... Figure 12 As shown, the computing unit may consist of only one independent computing unit composed of multiple modules, which can independently complete the fusion processing of the target instruction.

[0065] Specifically, such as Figure 12As shown, an independent computing unit may include a branch decision module (CMP), a multiplication calculation module (MUL), a logic operation module (LUT), and an addition calculation module (ADD); wherein: the input terminal of the branch decision module is used to receive a first input signal, a second input signal, and a third input signal, and the output terminal of the branch decision module is connected to the input terminal of the multiplication calculation module, used to determine the input value of each input signal according to the control signal, and to filter the intermediate signal according to the magnitude relationship between the first input signal and the second input signal, and send the intermediate signal and the third input signal to the multiplication calculation module; the output terminal of the multiplication calculation module... The output terminal is connected to the input terminal of the addition calculation module, and is used to calculate the third product value of the intermediate signal and the third input signal, and send the third product value to the addition calculation module; the input terminal of the logic operation module is used to receive the fourth input signal, and the output terminal of the logic operation module is connected to the input terminal of the addition calculation module, and is used to perform logical operations on the fourth input signal to obtain a logical operation result, send the logical operation result to the addition calculation module, and output the logical operation result as the intermediate instruction fusion calculation result; the addition calculation module is used to perform addition calculation on the third product value and the logical operation result to obtain the target instruction fusion calculation result.

[0066] like Figure 12 As shown, assume the target instruction is (A>B? A:B) Y+LUT(C) is a target instruction that compares the size of first data A and second data B, selects one of the data based on the comparison result, calculates the product of the selected data and the third data Y, and then calculates the sum of the product value and the logical operation result of the fourth data C. When an independent computing unit processes this type of target instruction, the first, second, and third input signals can be the three input signals received by the branch decision module. For example, the first input signal can be r0, the second input signal can be r1, and the third input signal can be r2. After receiving the first input signal r0, the second input signal r1, and the third input signal r2, the branch decision module determines the input value of each input signal according to the control signal, selects an intermediate signal based on the relationship between the first input signal r0 and the second input signal r1, and sends the intermediate signal and the third input signal r2 to the multiplication calculation module. Optionally, the branch decision module can select the signal with the smaller or larger value as the intermediate signal based on the relationship between the first input signal r0 and the second input signal r1 according to the control signal. That is, the intermediate signal can be either r0 or r1. Furthermore, the calculation unit can calculate the product between the intermediate signal and the third input signal r2 through the multiplication calculation module to obtain the third product value, and send the third product value to the addition calculation module.

[0067] like Figure 12 As shown, the input terminal of the logic operation module can receive the fourth input signal r3, perform logical operations on the fourth input signal r3 to obtain the logical operation result, send the calculated logical operation result to the addition calculation module, and output the logical operation result as the intermediate instruction fusion calculation result. For example, the logical operation result can be LUT(r3). Correspondingly, the addition calculation module can perform addition calculations on the third product value and the logical operation result to obtain the target instruction fusion calculation result. That is, the target instruction fusion calculation result can be (r0>r1?r0:r1). r2+LUT(r3).

[0068] It is understandable that the target calculation module in the above-mentioned independent calculation unit includes a branch judgment module, a multiplication calculation module, a logical operation module, and an addition calculation module.

[0069] The above technical solution is processed and executed in the processor (A>B? A:B) When dealing with a target instruction of type Y+LUT(C), a computational unit that includes a branch decision module, a multiplication calculation module, a logic operation module, and an addition calculation module determines the data flow direction of vector data in each computational module within the computational unit based on the control signals of the target instruction and the connection network. This allows the fusion instruction calculation operation of the target instruction to be executed sequentially through each computational module within the computational unit. Simultaneously, the intermediate instruction fusion calculation result of LUT(C) and (A>B? A:B) can be output. Y+LUT(C) targets the instruction fusion calculation results, improving the reuse rate and computing performance of computing units.

[0070] In one example Figure 13 This is a flowchart of an instruction execution method provided in an embodiment of this disclosure. This embodiment is applicable to situations where the ALU within a processor selects a computing module on demand to execute a fused instruction computation operation for a target instruction. This method can be executed by a processor including an ALU, which is generally integrated into an electronic device. This electronic device can be a terminal device or a server device; this disclosure does not limit the specific device type. Correspondingly, as... Figure 13 As shown, the method includes the following operations: S1310: Receive the target instruction and decode it to generate control signals and vector data.

[0071] S1320. Select a target computing module from each of the computing modules according to the control signal, and determine the data flow direction of the vector data in the target computing module according to the connection network.

[0072] S1330. According to the data flow direction of the control signal and the vector data in the target calculation module, the fusion instruction calculation operation of the target instruction is executed sequentially through each of the target calculation modules.

[0073] In an optional embodiment of this disclosure, the computing unit may include a first computing unit and a second computing unit; the step of sequentially executing the fusion instruction calculation operation of the target instruction through each of the target computing modules according to the data flow direction of the control signal and the vector data in the target computing module may include: the first computing unit sequentially executing the fusion instruction calculation operation of the target instruction through the target computing module of the first computing unit according to the data flow direction of the vector data in the target computing module of the first computing unit to obtain an intermediate instruction fusion calculation result; and the second computing unit sequentially executing the fusion instruction calculation operation of the target instruction through the target computing module of the second computing unit according to the data flow direction of the vector data in the target computing module of the second computing unit and the intermediate instruction fusion calculation result to obtain a target instruction fusion calculation result.

[0074] Optionally, the first computing unit further includes a first conversion bypass module, a second conversion bypass module, a first multiplexing module, and a second multiplexing module. The computing module of the first computing unit includes a first fusion multiply-accumulate module and an exponentiation module. The first computing unit, based on the data flow direction of the vector data in the target computing module of the first computing unit, sequentially executes the fusion instruction calculation operation of the target instruction through the target computing module of the first computing unit to obtain the intermediate instruction fusion calculation result. This includes: receiving multiple input signals through the first conversion bypass module, selecting a first target input signal from the multiple input signals according to the control signal, and sending the first target input signal to the first fusion multiply-accumulate module; and performing the fusion multiply-accumulate operation on the first target instruction according to the control signal through the first fusion multiply-accumulate module. A target input signal is subjected to fusion multiplication and addition calculation to obtain a first fusion multiplication and addition calculation result, and the first fusion multiplication and addition calculation result is sent to the first multiplexing module or the second multiplexing module according to the control signal; the output signal of the first conversion bypass module or the output signal of the first fusion multiplication and addition calculation result is sent to the exponentiation module through the first multiplexing module; the exponentiation module performs exponentiation calculation according to the received signal to obtain an exponentiation result, and sends the exponentiation result to the second multiplexing module; the second multiplexing module sends the received input signal to the third multiplexing module of the second conversion bypass module and the second calculation unit; the second conversion bypass module generates the intermediate instruction fusion calculation result according to the input signal.

[0075] Optionally, the second calculation unit further includes a third conversion bypass module, a fourth conversion bypass module, a third multiplexing module, and a fourth multiplexing module. The calculation module of the second calculation unit includes a second fusion multiply-accumulate module. The second calculation unit, based on the data flow of the vector data in the target calculation module of the second calculation unit and the intermediate instruction fusion calculation result, sequentially executes the fusion instruction calculation operation of the target instruction through the target calculation module of the second calculation unit to obtain the target instruction fusion calculation result. This includes: receiving multiple input signals through the third conversion bypass module, and according to the control signal, processing the multiple input signals... The system selects a second target input signal and sends it to the second fusion multiply-accumulate module. The third multiplexing module sends the received signal to the second fusion multiply-accumulate module. The second fusion multiply-accumulate module performs fusion multiply-accumulate calculations on the second target input signal according to the control signal to obtain a second fusion multiply-accumulate calculation result, and sends the output signal of the second fusion multiply-accumulate calculation result to the fourth multiplexing module. The fourth multiplexing module sends the received signal to the fourth conversion bypass module. The fourth conversion bypass module generates the target instruction fusion calculation result based on the input signal.

[0076] Optionally, the first computing unit, based on the data flow direction of the vector data in the target computing module of the first computing unit, sequentially executes the fusion instruction calculation operation of the target instruction through the target computing module of the first computing unit to obtain the intermediate instruction fusion calculation result. This includes: controlling the first conversion bypass module to select a first input signal and a second input signal according to the control signal, and sending the first input signal and the second input signal to the first fusion multiply-accumulate module; controlling the first fusion multiply-accumulate module to perform subtraction calculation on the first input signal and the second input signal according to the control signal to obtain the first fusion multiply-accumulate calculation result; sending the first fusion multiply-accumulate calculation result to the first multiplexing module according to the control signal, and controlling the first multiplexing module to send the first fusion multiply-accumulate calculation result to the exponentiation module; performing exponentiation operation on the first fusion multiply-accumulate calculation result through the exponentiation module to obtain the exponentiation result; and sending the exponentiation result to the second conversion bypass module for output through the second multiplexing module to obtain the intermediate instruction fusion calculation result. The second calculation unit, based on the data flow of the vector data in the target calculation module of the second calculation unit and the intermediate instruction fusion calculation result, sequentially executes the fusion instruction calculation operation of the target instruction through the target calculation module of the second calculation unit to obtain the target instruction fusion calculation result. This includes: the second calculation unit controlling the third conversion bypass module to select a sixth input signal according to the control signal, and sending the sixth input signal and the exponential running result to the second fusion multiply-accumulate module; the second fusion multiply-accumulate module of the second calculation unit calculating the sum between the sixth input signal and the exponential running result to obtain the second fusion multiply-accumulate calculation result; and sending the second fusion multiply-accumulate calculation result to the fourth conversion bypass module according to the control signal, so that the fourth conversion bypass module outputs the target instruction fusion calculation result based on the second fusion multiply-accumulate calculation result.

[0077] Optionally, the first computing unit executes the fusion instruction calculation operation of the target instruction sequentially through the target computing module of the first computing unit according to the data flow direction of the vector data in the target computing module of the first computing unit to obtain the intermediate instruction fusion calculation result. This includes: controlling the first conversion bypass module to select a first input signal, a second input signal, and a third input signal according to the control signal, and sending the first input signal, the second input signal, and the third input signal to the first fusion multiply-accumulate module according to the control signal; controlling the first fusion multiply-accumulate module to perform multiplication calculation on the first input signal and the second input signal according to the control signal, and calculating the sum of the multiplication result and the third input signal to obtain the first fusion multiply-accumulate calculation result; sending the first fusion multiply-accumulate calculation result to the second multiplexing module according to the control signal, and controlling the second multiplexing module to send the first fusion multiply-accumulate calculation result to the second conversion bypass module and the third conversion bypass module, so that the second conversion bypass module outputs the intermediate instruction fusion calculation result according to the first fusion multiply-accumulate calculation result. The second calculation unit, based on the data flow of the vector data in the target calculation module of the second calculation unit and the intermediate instruction fusion calculation result, sequentially executes the fusion instruction calculation operation of the target instruction through the target calculation module of the second calculation unit to obtain the target instruction fusion calculation result. This includes: the second calculation unit controlling the third conversion bypass module to select a fourth input signal and a fifth input signal according to the control signal, and sending the fourth input signal, the fifth input signal, and the first fusion multiplication and addition calculation result to the second fusion multiplication and addition module; the second fusion multiplication and addition module of the second calculation unit calculating a first product value between the fourth input signal and the fifth input signal, and calculating the sum of the first product value and the first fusion multiplication and addition calculation result to obtain the second fusion multiplication and addition calculation result; and sending the second fusion multiplication and addition calculation result to the fourth conversion bypass module according to the control signal, so that the fourth conversion bypass module outputs the target instruction fusion calculation result based on the second fusion multiplication and addition calculation result.

[0078] Optionally, the first computing unit executes the fusion instruction calculation operation of the target instruction sequentially through the target computing module of the first computing unit according to the data flow direction of the vector data in the target computing module of the first computing unit to obtain the intermediate instruction fusion calculation result. This includes: controlling the first conversion bypass module to select a first input signal and a third input signal according to the control signal, and sending the first input signal and the third input signal to the first fusion multiply-accumulate module; controlling the first fusion multiply-accumulate module to perform addition calculation on the first input signal and the third input signal according to the control signal to obtain the first fusion multiply-accumulate calculation result; sending the first fusion multiply-accumulate calculation result to the second multiplexing module according to the control signal, and controlling the second multiplexing module to send the first fusion multiply-accumulate calculation result to the second conversion bypass module and the third conversion bypass module, so that the second conversion bypass module outputs the intermediate instruction fusion calculation result according to the first fusion multiply-accumulate calculation result. The second calculation unit, based on the data flow of the vector data in the target calculation module of the second calculation unit and the intermediate instruction fusion calculation result, sequentially executes the fusion instruction calculation operation of the target instruction through the target calculation module of the second calculation unit to obtain the target instruction fusion calculation result. This includes: the second calculation unit controlling the third conversion bypass module to select a fourth input signal and a sixth input signal according to the control signal, and sending the fourth input signal, the sixth input signal, and the first fusion multiplication and addition calculation result to the second fusion multiplication and addition module; the second fusion multiplication and addition module of the second calculation unit calculating a second product value between the fourth input signal and the first fusion multiplication and addition calculation result, and calculating the sum of the second product value and the sixth input signal to obtain the second fusion multiplication and addition calculation result; and sending the second fusion multiplication and addition calculation result to the fourth conversion bypass module according to the control signal, so that the fourth conversion bypass module outputs the target instruction fusion calculation result based on the second fusion multiplication and addition calculation result.

[0079] Optionally, the target calculation module of the calculation unit includes a branch judgment module, a multiplication calculation module, a logic operation module, and an addition calculation module. Based on the data flow direction of the control signal and the vector data in the target calculation module, the target calculation module sequentially executes the fusion instruction calculation operation of the target instruction through each of the target calculation modules. This includes: receiving a first input signal, a second input signal, and a third input signal through the input terminal of the branch judgment module; determining the input value of each input signal according to the control signal; filtering an intermediate signal based on the magnitude relationship between the first and second input signals; sending the intermediate signal and the third input signal to the multiplication calculation module; calculating a third product value between the intermediate signal and the third input signal through the multiplication calculation module; sending the third product value to the addition calculation module; receiving a fourth input signal through the logic operation module; performing a logic operation on the fourth input signal to obtain a logic operation result; sending the logic operation result to the addition calculation module; and outputting the logic operation result as an intermediate instruction fusion calculation result; and performing an addition calculation on the third product value and the logic operation result through the addition calculation module to obtain the target instruction fusion calculation result.

[0080] This disclosure embodiment constructs a computing unit using multiple computing modules connected by a network. This computing unit receives control signals and vector data generated from decoding a target instruction. Based on the control signals, it selects a target computing module from among the various computing modules and determines the data flow direction of the vector data within the target computing module based on the network connection. Furthermore, the computing unit can sequentially execute the fused instruction calculation operation of the target instruction through each target computing module according to the control signals and the data flow direction of the vector data. Therefore, the computing unit provided by this disclosure embodiment integrates multiple computing modules, effectively consolidating hardware computing resources, significantly improving the performance of instruction scheduling within the computing unit, and thus enhancing the overall computing performance of the computing unit.

[0081] The aforementioned computing unit can execute the instruction execution method provided in any embodiment of this disclosure, and possesses the corresponding functional modules and beneficial effects of the execution method. For technical details not described in detail in this embodiment, please refer to the working principle and structure of the computing unit provided in any embodiment of this disclosure.

[0082] Since the computing unit described above is an apparatus capable of executing the instruction execution method in the embodiments of this disclosure, those skilled in the art can understand the specific implementation methods and various variations of the instruction execution method in this embodiment based on the working principle and structure of the computing unit described in the embodiments of this disclosure. Therefore, the specific implementation of the instruction execution method will not be described in detail here. As long as those skilled in the art implement the instruction execution method used by the computing unit in the embodiments of this disclosure, they will fall within the scope of protection intended by this disclosure.

[0083] The collection, storage, use, processing, transmission, provision, and disclosure of user personal information in this technical solution comply with relevant laws and regulations and do not violate public order and good morals.

[0084] It should be noted that any arrangement or combination of the technical features in the above embodiments also falls within the protection scope of this disclosure.

[0085] In one example Figure 14 This is a structural diagram of a processor provided in an embodiment of this disclosure, such as... Figure 14 As shown, a processor may include at least one computing unit. It should be noted that... Figure 14 This is merely a schematic diagram of one implementation method. The connection methods between each computing unit can be configured as needed, or the computing units can work independently. This disclosure does not limit the connection methods between the computing units within the processor.

[0086] In one example, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.

[0087] Figure 15 A schematic block diagram of an example electronic device 1500 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0088] like Figure 15 As shown, device 1500 includes computing unit 1501, which can be integrated into a processor ( Figure 15(Not shown in the image) It can perform various appropriate actions and processes according to a computer program stored in read-only memory (ROM) 1502 or a computer program loaded from storage unit 1508 into random access memory (RAM) 1503, and the number of such programs can be multiple. Figure 15 (Only one computing unit is shown). The RAM 1503 can also store various programs and data required for the operation of the device 1500. The computing unit 1501, ROM 1502, and RAM 1503 are interconnected via bus 1504. The input / output (I / O) interface 1505 is also connected to bus 1504.

[0089] Multiple components in device 1500 are connected to I / O interface 1505, including: input unit 1506, such as keyboard, mouse, etc.; output unit 1507, such as various types of monitors, speakers, etc.; storage unit 1508, such as disk, optical disk, etc.; and communication unit 1509, such as network card, modem, wireless transceiver, etc. Communication unit 1509 allows device 1500 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0090] Computing unit 1501 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of computing unit 1501 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. Computing unit 1501 performs the various methods and processes described above, such as instruction execution methods.

[0091] Optionally, the instruction execution method may include: receiving control signals and vector data generated by decoding a target instruction; selecting a target computing module from each of the computing modules according to the control signals, and determining the data flow direction of the vector data in the target computing module according to the connection network; and sequentially executing the fusion instruction calculation operation of the target instruction through each of the target computing modules according to the control signals and the data flow direction of the vector data in the target computing module.

[0092] For example, in some embodiments, the instruction execution method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 1508. In some embodiments, part or all of the computer program may be loaded and / or installed on device 1500 via ROM 1502 and / or communication unit 1509. When the computer program is loaded into RAM 1503 and executed by computing unit 1501, one or more steps of the instruction execution method described above may be performed. Alternatively, in other embodiments, computing unit 1501 may be configured to execute the instruction execution method by any other suitable means (e.g., by means of firmware).

[0093] Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various implementations may include: implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0094] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0095] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0096] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0097] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or middleware components (e.g., application servers), or frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.

[0098] Computer systems can include clients and servers. Clients and servers are generally geographically separated and typically interact via communication networks. The client-server relationship is established by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, also known as cloud computing servers or cloud hosts, which are hosting products within the cloud computing service ecosystem to address the shortcomings of traditional physical hosts and VPS services, such as high management difficulty and weak business scalability. Servers can also be servers for distributed systems or servers integrated with blockchain technology.

[0099] This disclosure embodiment constructs a computing unit using multiple computing modules connected by a network. This computing unit receives control signals and vector data generated from decoding a target instruction. Based on the control signals, it selects a target computing module from among the various computing modules and determines the data flow direction of the vector data within the target computing module based on the network connection. Furthermore, the computing unit can sequentially execute the fused instruction calculation operation of the target instruction through each target computing module according to the control signals and the data flow direction of the vector data. Therefore, the computing unit provided by this disclosure embodiment integrates multiple computing modules, effectively consolidating hardware computing resources, significantly improving the performance of instruction scheduling within the computing unit, and thus enhancing the overall computing performance of the computing unit.

[0100] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0101] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A computing unit, integrated in a processor, comprising multiple computing modules, wherein each computing module is connected via a network, wherein... The computing unit is used for: Receive target instructions, decode them, and generate control signals and vector data. The target computing module is selected from each of the computing modules according to the control signal, and the data flow direction of the vector data in the target computing module is determined according to the connection network. According to the data flow direction of the control signal and the vector data in the target calculation module, the fusion instruction calculation operation of the target instruction is executed sequentially through each of the target calculation modules.

2. The computing unit according to claim 1, wherein, The computing unit includes a first computing unit and a second computing unit; The first computing unit is used to: according to the data flow direction of the vector data in the target computing module of the first computing unit, sequentially execute the fusion instruction calculation operation of the target instruction through the target computing module of the first computing unit to obtain the intermediate instruction fusion calculation result; The second calculation unit is used to: based on the data flow direction of the vector data in the target calculation module of the second calculation unit and the intermediate instruction fusion calculation result, sequentially execute the fusion instruction calculation operation of the target instruction through the target calculation module of the second calculation unit to obtain the target instruction fusion calculation result.

3. The computing unit according to claim 2, wherein the first computing unit further comprises a first conversion bypass module, a second conversion bypass module, a first multiplexing module, and a second multiplexing module, and the computing module of the first computing unit comprises a first fusion multiply-accumulate module and an exponentiation operation module; wherein: The input terminal of the first conversion bypass module is used to receive multiple input signals, the output terminal of the first conversion bypass module is connected to the input terminal of the first fusion multiply-accumulate module, and the output terminal of the first conversion bypass module is connected to the input terminal of the first multiplexing module. It is used to select a first target input signal from the multiple input signals according to the control signal and send the first target input signal to the first fusion multiply-accumulate module. The output of the first fusion multiply-accumulate module is connected to the input of the first multiplexing module and the input of the second multiplexing module. It is used to perform fusion multiply-accumulate calculation on the first target input signal according to the control signal to obtain a first fusion multiply-accumulate calculation result, and send the first fusion multiply-accumulate calculation result to the first multiplexing module or the second multiplexing module according to the control signal. The output of the first multiplexing module is connected to the input of the exponentiation module, and is used to send the output signal of the first conversion bypass module or the output signal of the first fusion multiplication-addition calculation result to the exponentiation module; The output of the exponentiation module is connected to the input of the second multiplexing module, and is used to perform exponentiation based on the received signal, obtain the exponentiation result, and send the exponentiation result to the second multiplexing module. The input terminal of the second multiplexing module is also connected to the output terminal of the first conversion bypass module. The output terminal of the second multiplexing module is connected to the input terminal of the second conversion bypass module and the input terminal of the third multiplexing module of the second computing unit, for sending the received input signal to the second conversion bypass module and the third multiplexing module of the second computing unit. The second conversion bypass module is used to generate the intermediate instruction fusion calculation result based on the input signal.

4. The computing unit according to claim 3, wherein the second computing unit further comprises a third conversion bypass module, a fourth conversion bypass module, a third multiplexing module, and a fourth multiplexing module, and the computing module of the second computing unit includes a second fusion multiply-accumulate module; wherein: The input terminal of the third conversion bypass module is used to receive multiple input signals. The output terminal of the third conversion bypass module is connected to the input terminal of the second fusion multiply-accumulate module, and the output terminal of the third conversion bypass module is connected to the input terminal of the third multiplexing module. It is used to select a second target input signal from the multiple input signals according to the control signal, and send the second target input signal to the second fusion multiply-accumulate module. The input terminal of the third multiplexing module is connected to the output terminal of the third conversion bypass module and the output terminal of the second multiplexing module of the first computing unit. The output terminal of the third multiplexing module is connected to the input terminal of the second fusion multiply-accumulate module, and is used to send the received signal to the second fusion multiply-accumulate module. The output of the second fusion multiply-accumulate module is connected to the input of the fourth multiplexing module, and is used to perform fusion multiply-accumulate calculation on the second target input signal according to the control signal to obtain the second fusion multiply-accumulate calculation result, and send the output signal of the second fusion multiply-accumulate calculation result to the fourth multiplexing module; The input terminal of the fourth multiplexing module is also connected to the output terminal of the third conversion bypass module, and the output terminal of the fourth multiplexing module is connected to the input terminal of the fourth conversion bypass module, for sending the received signal to the fourth conversion bypass module; The fourth conversion bypass module is used to generate the target instruction fusion calculation result based on the input signal.

5. The computing unit according to claim 4, wherein, The first computing unit is also used for: According to the control signal, the first conversion bypass module is controlled to select the first input signal and the second input signal, and the first input signal and the second input signal are sent to the first fusion multiply-accumulate module; According to the control signal, the first fusion multiply-accumulate module is controlled to perform subtraction calculation on the first input signal and the second input signal to obtain the first fusion multiply-accumulate calculation result; The first fusion multiply-accumulate calculation result is sent to the first multiplexing module according to the control signal, and the first multiplexing module is controlled to send the first fusion multiply-accumulate calculation result to the exponentiation module. The exponential operation module performs exponential operations on the first fusion multiplication-addition calculation result to obtain the exponential operation result; The exponential running result is sent to the output of the second conversion bypass module through the second multiplexing module to obtain the intermediate instruction fusion calculation result; The second calculation unit is further configured to: control the third conversion bypass module to select a sixth input signal according to the control signal, and send the sixth input signal and the exponential running result to the second fusion multiply-accumulate module; The second fusion multiply-accumulate module calculates the sum between the sixth input signal and the exponential running result to obtain the second fusion multiply-accumulate calculation result. The second fusion multiply-accumulate calculation result is sent to the fourth conversion bypass module according to the control signal, so that the fourth conversion bypass module can output the target instruction fusion calculation result according to the second fusion multiply-accumulate calculation result.

6. The computing unit according to claim 4, wherein, The first computing unit is also used for: The first conversion bypass module is controlled according to the control signal to select the first input signal, the second input signal, and the third input signal, and then sends the first input signal, the second input signal, and the third input signal to the first fusion multiply-accumulate module. The first fusion multiply-accumulate module is controlled according to the control signal to perform multiplication calculation on the first input signal and the second input signal, and the sum of the multiplication result and the third input signal is calculated to obtain the first fusion multiply-accumulate calculation result; According to the control signal, the first fusion multiply-accumulate calculation result is sent to the second multiplexing module, and the second multiplexing module is controlled to send the first fusion multiply-accumulate calculation result to the second conversion bypass module and the third conversion bypass module, so that the second conversion bypass module can output the intermediate instruction fusion calculation result according to the first fusion multiply-accumulate calculation result; The second calculation unit is further configured to: control the third conversion bypass module to select a fourth input signal and a fifth input signal according to the control signal, and send the fourth input signal, the fifth input signal and the first fusion multiply-accumulate calculation result to the second fusion multiply-accumulate module; The second fusion multiply-accumulate module calculates the first product value between the fourth input signal and the fifth input signal, and calculates the sum of the first product value and the first fusion multiply-accumulate calculation result to obtain the second fusion multiply-accumulate calculation result; The second fusion multiply-accumulate calculation result is sent to the fourth conversion bypass module according to the control signal, so that the fourth conversion bypass module can output the target instruction fusion calculation result according to the second fusion multiply-accumulate calculation result.

7. The computing unit according to claim 4, wherein, The first computing unit is also used for: According to the control signal, the first conversion bypass module is controlled to select the first input signal and the third input signal, and the first input signal and the third input signal are sent to the first fusion multiply-accumulate module. According to the control signal, the first fusion multiply-accumulate module is controlled to perform addition calculation on the first input signal and the third input signal to obtain the first fusion multiply-accumulate calculation result; According to the control signal, the first fusion multiply-accumulate calculation result is sent to the second multiplexing module, and the second multiplexing module is controlled to send the first fusion multiply-accumulate calculation result to the second conversion bypass module and the third conversion bypass module, so that the second conversion bypass module can output the intermediate instruction fusion calculation result according to the first fusion multiply-accumulate calculation result; The second calculation unit is further configured to: control the third conversion bypass module to select a fourth input signal and a sixth input signal according to the control signal, and send the fourth input signal, the sixth input signal and the first fusion multiply-accumulate calculation result to the second fusion multiply-accumulate module; The second product value between the fourth input signal and the first fusion multiply-accumulate calculation result is calculated by the second fusion multiply-accumulate module, and the sum of the second product value and the sixth input signal is calculated to obtain the second fusion multiply-accumulate calculation result; The second fusion multiply-accumulate calculation result is sent to the fourth conversion bypass module according to the control signal, so that the fourth conversion bypass module can output the target instruction fusion calculation result according to the second fusion multiply-accumulate calculation result.

8. The computing unit according to claim 1, wherein, The target calculation module of the calculation unit includes a branch judgment module, a multiplication calculation module, a logical operation module, and an addition calculation module; wherein: The input terminal of the branch judgment module is used to receive a first input signal, a second input signal, and a third input signal. The output terminal of the branch judgment module is connected to the input terminal of the multiplication calculation module. It is used to determine the input value of each input signal according to the control signal, and to filter the intermediate signal according to the magnitude relationship between the first input signal and the second input signal, and send the intermediate signal and the third input signal to the multiplication calculation module. The output of the multiplication calculation module is connected to the input of the addition calculation module, and is used to calculate the third product value of the intermediate signal and the third input signal, and send the third product value to the addition calculation module. The input terminal of the logic operation module is used to receive a fourth input signal, and the output terminal of the logic operation module is connected to the input terminal of the addition calculation module. The logic operation module is used to perform logical operations on the fourth input signal to obtain a logical operation result, send the logical operation result to the addition calculation module, and output the logical operation result as an intermediate instruction to fuse the calculation result. The addition calculation module is used to perform addition calculation on the third product value and the logical operation result to obtain the target instruction fusion calculation result.

9. An instruction execution method applied to the ALU within a processor, comprising: Receive target instructions, decode them, and generate control signals and vector data. The target computing module is selected from each of the computing modules according to the control signal, and the data flow direction of the vector data in the target computing module is determined according to the connection network. According to the data flow direction of the control signal and the vector data in the target calculation module, the fusion instruction calculation operation of the target instruction is executed sequentially through each of the target calculation modules.

10. The method according to claim 9, wherein, The computing unit includes a first computing unit and a second computing unit; the step of sequentially executing the fusion instruction calculation operation of the target instruction through each target computing module according to the data flow direction of the control signal and the vector data in the target computing module includes: The first computing unit executes the fusion instruction calculation operation of the target instruction sequentially through the target computing module of the first computing unit according to the data flow direction of the vector data in the target computing module of the first computing unit, so as to obtain the intermediate instruction fusion calculation result. The second computing unit executes the target instruction fusion calculation operation sequentially through the target computing module of the second computing unit according to the data flow direction of the vector data in the target computing module of the second computing unit and the intermediate instruction fusion calculation result, so as to obtain the target instruction fusion calculation result.

11. A processor comprising at least one computing unit as described in any one of claims 1-8.

12. An electronic device comprising at least one processor as claimed in claim 11; and a memory communicatively connected to said at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the instruction execution method according to any one of claims 9-10.

13. A non-transitory computer-readable storage medium storing computer instructions, said computer instructions being used to cause a computer to perform the instruction execution method according to any one of claims 9-10.

14. A computer program product comprising a computer program / instructions, wherein, When the computer program / instruction is executed by the processor, it implements the instruction execution method as described in any one of claims 9-10.