Computing devices, integrated circuit chips, board cards, electronic devices, and computing methods
By processing the circuit array hardware architecture, efficient multi-threaded operation and multi-level pipelined operation of computing chips are realized, solving the problems of insufficient flexibility and efficiency in existing technologies, improving computing performance and reducing power consumption.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI CAMBRICON INFORMATION TECH CO LTD
- Filing Date
- 2020-06-30
- Publication Date
- 2026-06-26
AI Technical Summary
Existing computing chip instruction sets are inadequate in terms of flexibility, execution speed, execution efficiency, and power consumption, leading to increased on-chip I/O data throughput.
It adopts a processing circuit array hardware architecture, forming a one-dimensional or multi-dimensional array through multiple processing circuits, which can be configured into multiple processing circuit sub-arrays to support multi-threaded operation and multi-level pipeline operation, and flexibly execute calculation instructions.
It improves computing performance, reduces computational overhead and I/O data throughput, and enhances hardware architecture adaptability and execution efficiency.
Smart Images

Figure CN113867789B_ABST
Abstract
Description
Technical Field
[0001] This disclosure generally relates to the field of computing. More specifically, this disclosure relates to a computing device, an integrated circuit chip, a circuit board, an electronic device, and a computing method. Background Technology
[0002] In computing systems, an instruction set is a collection of instructions used to perform calculations and control the computing system, playing a crucial role in improving the performance of computing chips (such as processors). Current computing chips (especially those in the field of artificial intelligence) utilize associated instruction sets to perform various general or specific control and data processing operations. However, current instruction sets still have several shortcomings. For example, existing instruction sets are limited by hardware architecture, resulting in poor flexibility. Furthermore, many instructions can only perform a single operation, while the execution of multiple operations typically requires multiple instructions, potentially increasing on-chip I / O throughput. Additionally, current instructions still require improvement in execution speed, efficiency, and power consumption. Summary of the Invention
[0003] To at least address the problems existing in the prior art, this disclosure provides a hardware architecture with a processing circuit array. By utilizing this hardware architecture to execute computational instructions, the solution disclosed herein can achieve technical advantages in multiple aspects, including enhancing hardware processing performance, reducing power consumption, improving the execution efficiency of computational operations, and avoiding computational overhead.
[0004] In a first aspect, this disclosure provides a computing device, comprising: a processing circuit array consisting of a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, wherein the processing circuit array is configured as a plurality of processing circuit subarrays and performs multi-threaded operations in response to receiving a plurality of arithmetic instructions, and each processing circuit subarray is configured to execute at least one of the plurality of arithmetic instructions, wherein the plurality of arithmetic instructions are obtained by parsing the arithmetic instructions received by the computing device.
[0005] In a second aspect, this disclosure provides an integrated circuit chip that includes a computing device as described above and which will be described in several embodiments below.
[0006] In a third aspect, this disclosure provides a board that includes an integrated circuit chip as described above and which will be described in several embodiments below.
[0007] In a fourth aspect, this disclosure provides an electronic device that includes an integrated circuit chip as described above and which will be described in several embodiments below.
[0008] In a fifth aspect, this disclosure provides a method for performing computation using the aforementioned computing device, wherein the computing device includes a processing circuit array consisting of a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, and the processing circuit array is configured as a plurality of processing circuit subarrays. The method includes: receiving computation instructions at the computing device and parsing them to obtain a plurality of arithmetic instructions; and, in response to receiving the plurality of arithmetic instructions, performing multi-stage pipelined computation using the plurality of processing circuit subarrays, wherein each of the plurality of processing circuit subarrays is configured to execute at least one of the plurality of arithmetic instructions.
[0009] By using the computing device, integrated circuit chip, board, electronic device, and method disclosed above, an appropriate processing circuit array can be constructed according to computing requirements, thereby efficiently executing computing instructions, reducing computing overhead, and reducing I / O data throughput. Furthermore, since the processing circuits disclosed can be configured to support corresponding operations according to computational requirements, the number of operands for the computing instructions disclosed can be increased or decreased according to computational requirements, and the opcode type can be arbitrarily selected and combined from the operation types supported by the processing circuit matrix, thereby expanding the application scenarios and adaptability of the hardware architecture. Attached Figure Description
[0010] The above and other objects, features, and advantages of this disclosure will become readily apparent from the following detailed description of exemplary embodiments with reference to the accompanying drawings. In the drawings, several embodiments of this disclosure are illustrated by way of example and not limitation, and like or corresponding reference numerals denote like or corresponding parts, wherein:
[0011] Figure 1 This is a block diagram illustrating a computing device according to one embodiment of the present disclosure;
[0012] Figure 2a This is a block diagram illustrating a computing device according to another embodiment of the present disclosure;
[0013] Figure 2b This is a block diagram illustrating a computing device according to yet another embodiment of the present disclosure;
[0014] Figure 3 This is a block diagram illustrating a computing device according to yet another embodiment of the present disclosure;
[0015] Figure 4 This is an example structural diagram illustrating various types of processing circuit arrays of a computing device according to embodiments of this disclosure;
[0016] Figure 5a 5b, 5c and 5d are schematic diagrams illustrating various connection relationships of multiple processing circuits according to embodiments of the present disclosure;
[0017] Figure 6a 6b, 6c and 6d are schematic diagrams illustrating various additional connection relationships of multiple processing circuits according to embodiments of the present disclosure;
[0018] Figure 7a 7b, 7c and 7d are schematic diagrams illustrating various ring structures of the processing circuit according to embodiments of the present disclosure;
[0019] Figure 8a 8b and 8c are schematic diagrams illustrating various other ring structures of the processing circuit according to embodiments of the present disclosure;
[0020] Figure 9a 9b, 9c and 9d are schematic diagrams illustrating data splicing operations performed by the pre-operation circuit according to embodiments of the present disclosure;
[0021] Figure 10a 10b and 10c are schematic diagrams illustrating data compression operations performed by the post-operation circuit according to embodiments of the present disclosure;
[0022] Figure 11 This is a simplified flowchart illustrating a method for performing computational operations using a computing device according to an embodiment of this disclosure;
[0023] Figure 12 This is a structural diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure; and
[0024] Figure 13 This is a schematic diagram illustrating the structure of a circuit board according to an embodiment of the present disclosure. Detailed Implementation
[0025] This disclosure provides a hardware architecture supporting multi-threaded computation. When this hardware architecture is implemented in a computing device, the computing device includes at least multiple processing circuits, which are connected according to different configurations to form a one-dimensional or multi-dimensional array structure. Depending on the implementation, the processing circuit array can be configured into multiple processing circuit subarrays, and each processing circuit subarray can be configured to execute at least one of multiple computation instructions. With the help of the hardware architecture and computation instructions disclosed herein, computational operations can be performed efficiently, expanding the application scenarios of computing and reducing computational overhead.
[0026] The technical solutions in the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this disclosure, not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of this disclosure without creative effort are within the scope of protection of this disclosure.
[0027] Figure 1 This is a block diagram illustrating a computing device 80 according to one embodiment of the present disclosure. Figure 1 As shown, the computing device 80 includes a processing circuit array formed by a plurality of processing circuits 104. Specifically, the plurality of processing circuits are connected in a two-dimensional array structure to form the processing circuit array, and include a plurality of processing circuit subarrays, such as the plurality of one-dimensional processing circuit subarrays M1, M2, ... M shown in the figure. n It should be understood that the two-dimensional processing circuit array and its multiple one-dimensional processing circuit subarrays are merely exemplary and not restrictive. The processing circuit array disclosed herein can be configured into array structures with different dimensions depending on different computing scenarios, and one or more closed loops can be formed within or between multiple processing circuit subarrays, as shown in the exemplary connections illustrated in Figures 5-8, which will be described later.
[0028] In one embodiment, in response to receiving multiple arithmetic instructions, the processing circuit array disclosed herein can be configured to perform multi-threaded arithmetic, such as executing Single Instruction Multithreaded (“SIMT”) instructions. Further, each processing circuit subarray can be configured to execute at least one of the aforementioned multiple arithmetic instructions. In the context of this disclosure, the aforementioned multiple arithmetic instructions can be microinstructions or control signals running within a computing device (or processing circuit, processor), which may include (or indicate) one or more arithmetic operations to be performed by the computing device. Depending on the different arithmetic scenarios, the arithmetic operations may include, but are not limited to, addition operations, multiplication operations, convolution operations, pooling operations, and various other operations.
[0029] In one embodiment, the aforementioned multiple arithmetic instructions may include at least one multi-stage pipelined operation. In one scenario, the aforementioned multi-stage pipelined operation may include at least two arithmetic instructions. Depending on different execution requirements, the arithmetic instructions disclosed herein may include predicates, and each processing circuit determines whether to execute the associated arithmetic instruction based on the predicate. The processing circuits disclosed herein can flexibly perform various arithmetic operations according to their configuration, including but not limited to arithmetic operations, logical operations, comparison operations, and table lookup operations.
[0030] by Figure 1 The processing circuit matrix shown includes M1 to M... n Taking an n-stage pipelined operation as an example, where processing circuit submatrices M1 can act as the first-stage pipelined unit, and processing circuit submatrices M2 can act as the second-stage pipelined unit, and so on, the processing circuit submatrices M... nIt can serve as the nth level pipeline unit in this pipeline operation. During the execution of the nth level pipeline operation, each level of operation can be executed from top to bottom, starting from the first level pipeline unit, until the nth level pipeline operation is completed.
[0031] Based on the exemplary description of the processing circuit subarray above, it can be understood that the processing circuit array disclosed herein may be a one-dimensional array in some scenarios, and one or more processing circuits in the processing circuit array are configured as a processing circuit subarray. In other scenarios, the processing circuit array disclosed herein is a two-dimensional array, and one or more rows of processing circuits in the processing circuit array are configured as a processing circuit subarray; or one or more columns of processing circuits in the processing circuit array are configured as a processing circuit subarray; or one or more rows of processing circuits along the diagonal direction in the processing circuit array are configured as a processing circuit subarray.
[0032] To achieve multi-stage pipelined computation, this disclosure also provides corresponding computation instructions, and based on these instructions, configures and constructs a processing circuit array to achieve multi-stage pipelined computation. Depending on the computation scenario, the computation instructions disclosed herein may include multiple opcodes, which can represent multiple operations performed by the processing circuit array. For example, when... Figure 1 When n = 4 (i.e., when performing a 4-level pipeline operation), the calculation instructions in this disclosed scheme can be expressed as follows: (1)
[0033] Result=convert((((scr0 op0 scr1)op1 src2)op2 src3)op3 src4) (1)
[0034] Where scr0~src4 are source operands, op0~op3 are opcodes, and convert indicates performing a data conversion operation on the data obtained after executing opcode op4. Depending on the implementation, the aforementioned data conversion operation can be performed by the processing circuit in the processing circuit array, or by another operation circuit, such as by a later-combined... Figure 3 The post-operation circuitry described in detail is used for execution. According to the scheme disclosed herein, since the processing circuitry can be configured to support corresponding operations according to the operational requirements, the number of operands of the computation instructions disclosed herein can be increased or decreased according to the operational requirements, and the type of opcode can also be arbitrarily selected and combined from the operation types supported by the processing circuitry matrix.
[0035] Depending on the application scenario, the connection between the multiple processing circuits disclosed herein can be either a hardware-based configuration connection (or "hard connection") or a logical configuration connection (or "soft connection") based on specific hardware connections and configured by software (e.g., through configuration instructions). In one embodiment, the processing circuit array can form a closed loop in at least one dimension in one or more dimensions, i.e., a "loop structure" in the context of this disclosure.
[0036] Figure 2a This is a block diagram illustrating a computing device 100 according to another embodiment of this disclosure. As can be seen from the figure, in addition to having the same processing circuitry 104 as computing device 80, computing device 100 also includes a control circuitry 102. In one embodiment, the control circuitry 102 may be configured to acquire and parse the computation instructions described above to obtain the plurality of arithmetic instructions corresponding to the plurality of operations represented by the opcodes, such as those represented by equation (1). In another embodiment, the control circuitry configures the processing circuitry array according to the plurality of arithmetic instructions to obtain the plurality of processing circuitry subarrays, for example... Figure 1 The processing circuit subarrays M1, M2...M shown are n .
[0037] In one application scenario, the control circuit may include a register for storing configuration information, and the control circuit may extract the corresponding configuration information according to the plurality of operation instructions, and configure the processing circuit array according to the configuration information to obtain the plurality of processing circuit sub-arrays.
[0038] In one embodiment, the control circuit may include one or more registers storing configuration information about the processing circuit array. The control circuit is configured to read the configuration information from the registers according to the configuration instructions and send it to the processing circuits so that the processing circuits are connected using the configuration information. In one application scenario, the configuration information may include preset location information of the processing circuits comprising the one or more processing circuit arrays. This location information may, for example, include coordinate information or label information of the processing circuits.
[0039] When the processing circuit array is configured to form a closed loop, the configuration information may further include loop-forming configuration information regarding the processing circuit array forming a closed loop. Alternatively, in one embodiment, the above-mentioned configuration information may be carried directly by configuration instructions rather than read from the register. In this case, the processing circuits can be directly configured according to the position information in the received configuration instructions to form an array without closed loops with other processing circuits or to further construct an array with closed loops.
[0040] When a connection is configured to form a two-dimensional array via configuration instructions or configuration information obtained from a register, the processing circuit located in the two-dimensional array is configured to be connected to one or more other processing circuits in the same row, column, or diagonal in at least one of its row, column, or diagonal directions in a predetermined two-dimensional spacing pattern to form one or more closed loops. Here, the aforementioned predetermined two-dimensional spacing pattern is associated with the number of processing circuits spaced apart in the connection.
[0041] Furthermore, when configuring the connection to form a three-dimensional array according to the aforementioned configuration instructions or configuration information, the processing circuit array is connected in a loop manner as a three-dimensional array consisting of multiple layers. Each layer includes a two-dimensional array of multiple processing circuits arranged along the row, column, and diagonal directions. The processing circuits located in the three-dimensional array are configured to be connected to one or more other processing circuits in the same row, column, diagonal, or different layers in at least one of the row, column, diagonal, and layer directions with a predetermined three-dimensional spacing pattern to form one or more closed loops. Here, the predetermined three-dimensional spacing pattern is associated with the number of spacings and the number of spacing layers between the processing circuits to be connected.
[0042] Figure 2b This is a block diagram illustrating a computing device 200 according to another embodiment of the present disclosure. As can be seen from the figure, in addition to including the same control circuitry 102 and multiple processing circuits 104 as computing device 100, computing device 200 in FIG2 also includes a storage circuitry 106.
[0043] In one application scenario, the aforementioned storage circuit can be configured with interfaces for data transmission in multiple directions to connect with multiple processing circuits 104, thereby enabling the corresponding storage of data to be processed by the processing circuits, intermediate results obtained during the processing, and final results obtained after the processing. In view of the foregoing, in one application scenario, the storage circuit disclosed herein may include a main storage module and / or a main cache module, wherein the main storage module is configured to store data for processing in the processing circuit array and the results of the processing, and the main cache module is configured to cache intermediate results of the processing in the processing circuit array. Furthermore, the storage circuit may also have an interface for data transmission with off-chip storage media, thereby enabling data transfer between on-chip and off-chip systems.
[0044] Figure 3 This is a block diagram illustrating a computing device 300 according to yet another embodiment of the present disclosure. As can be seen from the figure, in addition to including the same control circuitry 102, multiple processing circuits 104, and storage circuitry 106 as computing device 200, Figure 3The computing device 300 further includes a data manipulation circuit 108, which comprises a pre-processing circuit 110 and a post-processing circuit 112. Based on this hardware architecture, the pre-processing circuit 110 is configured to perform preprocessing of input data for at least one of the arithmetic instructions, while the post-processing circuit 112 is configured to perform post-processing of output data for at least one arithmetic instruction. In one embodiment, the preprocessing performed by the pre-processing circuit may include data placement and / or table lookup operations, while the post-processing performed by the post-processing circuit may include data type conversion and / or compression operations.
[0045] In one application scenario, during a table lookup operation, the pre-operation circuitry is configured to look up one or more tables using an index value to obtain one or more constant entries associated with the operand from the one or more tables. Alternatively or additionally, the pre-operation circuitry is configured to determine the associated index value using the operand, and to look up one or more tables using the index value to obtain one or more constant entries associated with the operand from the one or more tables.
[0046] In one application scenario, the pre-operation circuit can split the computational data according to the type of the computational data and the logical addresses of each processing circuit, and then transmit the resulting multiple sub-data to the corresponding processing circuits in the array for computation. In another application scenario, the pre-operation circuit can select a data concatenation mode from multiple data concatenation modes according to the parsed instructions to perform a concatenation operation on the two input data. In one application scenario, the post-operation circuit can be configured to perform a data compression operation, which includes filtering the data using a mask or by comparing a given threshold with the data size, thereby achieving data compression.
[0047] Based on the above Figure 3 Based on the hardware architecture, the computing device disclosed herein can execute computing instructions including the aforementioned preprocessing and postprocessing. Therefore, the data conversion operation of the computing instructions expressed in equation (1) above can be executed by the aforementioned post-processing circuit. Two exemplary examples of computing instructions according to the scheme disclosed herein will be given below:
[0048] Example 1: TMUADCO=MULT+ADD+RELU(N)+CONVERTFP2FIX(2)
[0049] The instruction expressed in (2) above is a computation instruction that takes a ternary operand as input and outputs a unary operand, and it can be completed by a processing circuit matrix according to this disclosure, which includes a three-stage pipeline (i.e., multiplication + addition + activation). Specifically, the ternary operation is A*B+C, where the microinstruction MULT performs the multiplication operation between operands A and B to obtain the product value, i.e., the first-stage pipeline. Next, the microinstruction ADD is executed to perform the addition operation between the aforementioned product value and C to obtain the summation result "N", i.e., the second-stage pipeline. Then, the activation operation RELU is performed on the result, i.e., the third-stage pipeline. After this three-stage pipeline, the microinstruction CONVERTFP2FIX can be executed through the post-operation circuit described above to convert the type of the result data after the activation operation from floating-point number to fixed-point number, so that it can be output as the final result or input as an intermediate result to the fixed-point arithmetic unit for further computation.
[0050] Example 2: TSEADMUAD=SEARCHADD+MULT+ADD (3)
[0051] The instruction expressed in equation (3) above is a computational instruction that takes a ternary operand as input and outputs a unary operand, and it includes microinstructions that can be performed by a processing circuit matrix including a two-stage pipelined operation (i.e., multiplication + addition) according to this disclosure. Specifically, the ternary operation is A*B+C, where the SEARCHADD microinstruction can be performed by the pre-operation circuit to obtain the lookup result A. Then, the multiplication operation between operands A and B is performed by the first-stage pipelined operation to obtain the product value. Then, the ADD microinstruction is executed to perform the addition operation of the aforementioned product value and C to obtain the summation result "N", that is, the second-stage pipelined operation.
[0052] As mentioned above, the computation instructions disclosed herein can be flexibly designed and determined according to the computation requirements. Thus, the hardware architecture disclosed herein, which includes multiple processing circuit sub-matrices, can be designed and connected according to the computation instructions and the specific operations they perform, thereby improving the execution efficiency of the instructions and reducing computational overhead.
[0053] Figure 4 This is an example structural diagram illustrating various types of processing circuit arrays in a computing device 400 according to an embodiment of this disclosure. As can be seen from the figure, Figure 4 The computing device 400 shown has a similar Figure 3 The architecture is similar to that of the computing device 300 shown, therefore regarding Figure 3 The description of the computing device 300 also applies to Figure 4 The same details are shown in the previous text, so they will not be repeated below.
[0054] from Figure 4As can be seen, the multiple processing circuits may include, for example, multiple first-type processing circuits 104-1 and multiple second-type processing circuits 104-2 (distinguished by different background colors in the figure). These multiple processing circuits can be physically connected and arranged to form a two-dimensional array. For example, as shown in the figure, the two-dimensional array has M rows and N columns (represented as M*N) of first-type processing circuits, where M and N are positive integers greater than 0. The first-type processing circuits can be used to perform arithmetic and logical operations, such as linear operations like addition, subtraction, and multiplication, nonlinear operations like comparison and AND, OR, and NOT, or any combination of the aforementioned operations. Furthermore, on the left and right sides of the outer perimeter of the M*N first-type processing circuit array, there are two columns of second-type processing circuits, totaling (M*2+M*2), and on the lower outer perimeter, there are two rows of second-type processing circuits, totaling (N*2+8), meaning the processing circuit array has a total of (M*2+M*2+N*2+8) second-type processing circuits. In one embodiment, the second type of processing circuit can be used to perform nonlinear operations on the received data, such as comparison operations, table lookup operations, or shift operations. In one or more embodiments, the first type of processing circuit can form the first processing circuit subarray disclosed herein, and the second type of processing circuit can form the second processing circuit subarray disclosed herein, in order to perform multi-threaded operations. In one scenario, when the multi-threaded operation involves multiple operation instructions and the multiple operation instructions constitute a multi-stage pipeline operation, the first processing circuit subarray can execute several stages of the multi-stage pipeline operation, while the second processing subarray can execute several other stages of the pipeline operation. In another scenario, when the multi-threaded operation involves multiple operation instructions and the multiple operation instructions constitute two multi-stage pipeline operations, the first processing circuit subarray can execute a first multi-stage pipeline operation, while the second processing circuit subarray can execute a second multi-stage pipeline operation.
[0055] In some application scenarios, the storage circuits used in the first type of processing circuit and the second type of processing circuit can have different storage scales and storage methods. For example, the predicate storage circuit in the first type of processing circuit can use multiple numbered registers to store predicate information. Furthermore, the first type of processing circuit can access the predicate information in the register corresponding to the number specified in the received parsed instruction. As another example, the second type of processing circuit can use static random access memory (“SRAM”) to store the predicate information. Specifically, the second type of processing circuit can determine the storage address of the predicate information in the SRAM based on the offset of the location of the predicate information specified in the received parsed instruction, and can perform predetermined read or write operations on the predicate information at that storage address.
[0056] Figure 5a Figures 5b, 5c, and 5d are schematic diagrams illustrating various connection relationships of multiple processing circuits according to embodiments of the present disclosure. As previously described, the multiple processing circuits of the present disclosure can be connected by hard-wired connections or by logical connections according to configuration instructions, thereby forming a topology of a connected one-dimensional or multi-dimensional array. When multiple processing circuits are connected in a multi-dimensional array, the multi-dimensional array can be a two-dimensional array, and the processing circuits located in the two-dimensional array can be connected to one or more other processing circuits in the same row, column, or diagonal in at least one direction of the array, with a predetermined two-dimensional spacing pattern. The predetermined two-dimensional spacing pattern can be associated with the number of processing circuits spaced apart in the connection. Figures 5a to 5c Exemplary topologies of various forms of two-dimensional arrays between multiple processing circuits are shown.
[0057] like Figure 5a As shown, five processing circuits (each represented by a box) are connected to form a simple two-dimensional array. Specifically, with one processing circuit as the center of the two-dimensional array, one processing circuit is connected to each of the four horizontal and vertical directions relative to that processing circuit, thus forming a two-dimensional array with three rows and three columns. Furthermore, since the processing circuit located at the center of the two-dimensional array is directly connected to the processing circuits adjacent to the preceding and following columns in the same row, and to the processing circuits adjacent to the preceding and following rows in the same column, the number of intervening processing circuits (referred to as the "interval number") is 0.
[0058] like Figure 5b As shown, the four rows and four columns of processing circuits can be connected to form a two-dimensional Torus array. Each processing circuit is connected to the processing circuits in its preceding and following rows and columns, respectively, meaning the number of intervals between adjacent processing circuits is 0. Furthermore, the first processing circuit in each row or column of this two-dimensional Torus array is also connected to the last processing circuit in that row or column, with the number of intervals between the first and last connected processing circuits in each row or column being 2.
[0059] like Figure 5c As shown, the four rows and four columns of processing circuits can also be connected to form a two-dimensional array where the interval between adjacent processing circuits is 0 and the interval between non-adjacent processing circuits is 1. Specifically, in this two-dimensional array, adjacent processing circuits in the same row or column are directly connected, i.e., the interval is 0, while non-adjacent processing circuits in the same row or column are connected to processing circuits with an interval of 1. It can be seen that when multiple processing circuits are connected to form a two-dimensional array, Figure 5b and Figure 5cThe processing circuits shown in the same row or column can have different numbers of intervals. Similarly, in some scenarios, processing circuits can be connected with different numbers of intervals in the diagonal direction.
[0060] like Figure 5d As shown, using four such Figure 5b The illustrated two-dimensional Torus array can be arranged into four layers at predetermined intervals and connected to form a three-dimensional Torus array. This three-dimensional Torus array, based on the two-dimensional Torus array, utilizes an interval pattern similar to that between rows and columns for inter-layer connections. For example, firstly, processing circuits in adjacent layers in the same row and column are directly connected, i.e., the interval number is 0. Next, processing circuits in the first and last layers in the same row and column are connected, i.e., the interval number is 2. Ultimately, a four-layer, four-row, four-column three-dimensional Torus array can be formed.
[0061] Through the examples above, those skilled in the art will understand that the connection relationships of other multidimensional arrays of processing circuits can be formed on the basis of a two-dimensional array by adding new dimensions and increasing the number of processing circuits. In some application scenarios, the solutions disclosed herein can also configure logical connections of processing circuits using configuration instructions. In other words, although there may be hardwired connections between processing circuits, the solutions disclosed herein can also selectively connect some processing circuits or selectively bypass some processing circuits through configuration instructions to form one or more logical connections. In some embodiments, the aforementioned logical connections can also be adjusted according to the actual computational needs (e.g., data type conversion). Furthermore, for different computing scenarios, the solutions disclosed herein can configure the connections of processing circuits, including, for example, configuring them as matrices or as one or more closed computational loops.
[0062] Figure 6a Figures 6b, 6c, and 6d are schematic diagrams illustrating various additional connection relationships of multiple processing circuits according to embodiments of the present disclosure. As can be seen from the figures, Figures 6a to 6d Is Figures 5a to 5d This illustrates yet another exemplary connection relationship of a multidimensional array formed by multiple processing circuits. In view of this, combined with... Figures 5a to 5d The described technical details also apply to Figures 6a to 6d The content shown.
[0063] like Figure 6a As shown, the processing circuitry of the two-dimensional array includes a central processing circuit located at the center of the array and three processing circuits connected in four directions (row and column) to the central processing circuit. Therefore, the number of intervals between the central processing circuit and the remaining processing circuits are 0, 1, and 2, respectively. Figure 6bAs shown, the processing circuit of the two-dimensional array includes a central processing circuit located at the center of the two-dimensional array, three processing circuits in two opposite directions running parallel to the central processing circuit, and one processing circuit in two opposite directions in the same column as the central processing circuit. Therefore, the number of intervals between the central processing circuit and the processing circuits running parallel to it are 0 and 2, respectively, and the number of intervals between the central processing circuit and the processing circuits in the same column are both 0.
[0064] As mentioned above Figure 5d As shown, the multidimensional array formed by multiple processing circuits can be a three-dimensional array composed of multiple layers. Each layer of the three-dimensional array can include a two-dimensional array of multiple processing circuits arranged along its row and column directions. Further, the processing circuits located in the three-dimensional array can be connected to one or more other processing circuits in the same row, column, diagonal, or different layers in at least one of the row, column, diagonal, and layer directions with a predetermined three-dimensional spacing pattern. Further, the predetermined three-dimensional spacing pattern and the number of processing circuits spaced apart in the connection can be related to the number of layers. The following will combine... Figure 6c and Figure 6d The connection method of the three-dimensional array is further described.
[0065] Figure 6c This diagram illustrates a multi-layered, multi-row, multi-column three-dimensional array formed by connecting multiple processing circuits. Taking a processing circuit located in the l-th layer, r-th row, and c-th column (denoted as (l, r, c)) as an example, it is positioned at the center of the array and is connected to processing circuits in the preceding column (l, r, c-1) and following column (l, r, c+1) of the same layer and row, as well as processing circuits in the preceding row (l, r-1, c) and following row (l, r+1, c) of the same layer and column, and processing circuits in the preceding layer (l-1, r, c) and following layer (l+1, r, c) of different layers within the same row and column. Furthermore, the number of intervals between the processing circuit at (l, r, c) and other processing circuits in the row, column, and layer directions is zero.
[0066] Figure 6dThis diagram illustrates a three-dimensional array where the number of intervals between multiple processing circuits in the row, column, and layer directions is all 1. Taking the processing circuit located at the center (l, r, c) of the array as an example, it is connected to processing circuits at positions (l, r, c-2) and (l, r, c+2) that are one column apart in the same row and at different columns within the same layer, and at positions (l, r-2, c) and (l, r+2, c) that are one row apart in the same column and at different rows within the same layer. Furthermore, it is connected to processing circuits at positions (l-2, r, c) and (l+2, r, c) that are one layer apart in the same row and at different layers within the same column. Similarly, the remaining processing circuits at positions (l, r, c-3) and (l, r, c-1) that are one column apart in the same row and at the same layer are connected to each other, while the processing circuits at positions (l, r, c+1) and (l, r, c+3) are connected to each other. Next, the processing circuits at (l, r-3, c) and (l, r-1, c) on the same layer and column, separated by one row, are connected to each other, and the processing circuits at (l, r+1, c) and (l, r+3, c) are connected to each other. Additionally, the processing circuits at (l-3, r, c) and (l-1, r, c) on the same row and column, separated by one layer, are connected to each other, and the processing circuits at (l+1, r, c) and (l+3, r, c) are connected to each other.
[0067] The above text provides an exemplary description of the connection relationship of a multidimensional array formed by multiple processing circuits. The following text will further illustrate the different loop structures formed by multiple processing circuits in conjunction with Figures 7 and 8.
[0068] Figure 7a Figures 7b, 7c, and 7d are schematic diagrams illustrating various loop structures of the processing circuit according to embodiments of this disclosure. Depending on the application scenario, the multiple processing circuits can be connected not only physically, but also logically, based on received parsed instructions. The multiple processing circuits can be configured to form a closed loop using these logical connections.
[0069] like Figure 7aAs shown, the four adjacent processing circuits are sequentially numbered "0, 1, 2, and 3". Then, starting with processing circuit 0, these four processing circuits are connected sequentially in a clockwise direction, and processing circuit 3 is connected to processing circuit 0, so that the four processing circuits are connected in series to form a closed loop (referred to as "a loop"). In this loop, the number of intervals between the processing circuits is 0 or 2; for example, the number of intervals between processing circuits 0 and 1 is 0, while the number of intervals between processing circuits 3 and 0 is 2. Furthermore, the physical addresses (also referred to as physical coordinates in the context of this disclosure) of the four processing circuits in the loop can be represented as 0-1-2-3, and their logical addresses (also referred to as logical coordinates in the context of this disclosure) can also be represented as 0-1-2-3. It should be noted that... Figure 7a The connection order shown is merely exemplary and not restrictive. Those skilled in the art may also connect the four processing circuits in series in a counterclockwise direction to form a closed loop, depending on actual calculation needs.
[0070] In some practical scenarios, when the data bit width supported by a single processing circuit cannot meet the bit width requirements of the processed data, multiple processing circuits can be combined into a processing circuit group to represent a single data point. For example, suppose a processing circuit can process 8-bit data. When 32-bit data needs to be processed, four processing circuits can be combined into a processing circuit group to connect four 8-bit data points to form a 32-bit data point. Furthermore, the aforementioned processing circuit group formed by four 8-bit processing circuits can act as... Figure 7b The diagram shows a processing circuit 104, which can support higher bit-width arithmetic operations.
[0071] from Figure 7b As can be seen from this, the layout of the processing circuit shown is similar to... Figure 7a Similar to what is shown, but Figure 7b The number of intervals between the intermediate processing circuits and Figure 7a different. Figure 7b The diagram shows four processing circuits numbered 0, 1, 2, and 3, connected sequentially in a clockwise direction, starting with processing circuit 0, followed by processing circuit 1, processing circuit 3, and processing circuit 2. Processing circuit 2 is also connected to processing circuit 0, forming a closed loop in series. This loop demonstrates that... Figure 7bThe number of intervals between the processing circuits shown is either 0 or 1; for example, the interval between processing circuits 0 and 1 is 0, while the interval between processing circuits 1 and 3 is 1. Furthermore, the physical addresses of the four processing circuits in the closed loop shown can be 0-1-2-3, while the logical addresses, according to the shown loop arrangement, can be represented as 0-1-3-2. Therefore, when it is necessary to split high-bit-width data to allocate it to different processing circuits, the data order can be rearranged and allocated according to the logical addresses of the processing circuits.
[0072] The above-mentioned splitting and rearranging operations can be performed by combining... Figure 3 The described pre-operation circuit is used for execution. Specifically, this pre-operation circuit can rearrange the input data according to the physical and logical addresses of multiple processing circuits to meet the requirements of data operations. Assume four sequentially arranged processing circuits 0 to 3 as follows... Figure 7a The connections shown, since both their physical and logical addresses are 0-1-2-3, allow the preceding operation circuit to sequentially transmit input data (e.g., pixel data) aa0, aa1, aa2, and aa3 to the corresponding processing circuits. However, when the aforementioned four processing circuits... Figure 7b When the connection is shown, the physical address remains unchanged at 0-1-2-3, while the logical address changes to 0-1-3-2. At this time, the pre-processing circuit needs to rearrange the input data aa0, aa1, aa2, and aa3 into aa0-aa1-aa3-aa2 to transmit it to the corresponding processing circuit. Based on the above-mentioned input data rearrangement, the disclosed scheme can guarantee the correctness of the data operation order. Similarly, if the order of the four operation output results (e.g., pixel data) obtained above is bb0-bb1-bb3-bb2, the post-processing circuit described in conjunction with Figure 2 can be used to restore and adjust the order of the operation output results to bb0-bb1-bb2-bb3 to ensure the consistency of the arrangement between the input data and the output result data.
[0073] Figure 7c and Figure 7d More processing circuits are shown arranged and connected in different ways to form closed loops. For example... Figure 7cAs shown, 16 processing circuits 104, numbered sequentially from 0 to 15, are connected and combined in pairs, starting with processing circuit 0, to form a processing circuit group (i.e., the processing circuit subarray disclosed herein). For example, as shown, processing circuit 0 is connected to processing circuit 1 to form a processing circuit group… and so on. Processing circuit 14 is connected to processing circuit 15 to form a processing circuit group, ultimately forming eight processing circuit groups. Furthermore, these eight processing circuit groups can also be connected in a manner similar to the aforementioned processing circuit connections, including connecting them according to, for example, predetermined logical addresses to form a closed loop of a processing circuit group.
[0074] like Figure 7d As shown, multiple processing circuits 104 are connected in an irregular or non-uniform manner to form a processing circuit matrix with closed loops. Specifically, in Figure 7d The diagram shows that the processing circuits can form closed loops with intervals of 0 or 3. For example, processing circuit 0 can be connected to processing circuit 1 (interval of 0) and processing circuit 4 (interval of 3), respectively.
[0075] Based on the above combination Figure 7a , 7b As described in 7c and 7d, the processing circuit disclosed herein can have varying numbers of processing circuits spaced apart to form a closed loop. When the total number of processing circuits changes, any number of intermediate intervals can be dynamically configured to form a closed loop. Multiple processing circuits can also be combined into a processing circuit group and connected to form a closed loop of the processing circuit group. Furthermore, the connection of multiple processing circuits can be a hard connection in hardware or a soft connection in software.
[0076] Figure 8a Figures 8b and 8c are schematic diagrams illustrating various other loop structures of the processing circuit according to embodiments of this disclosure. As shown in conjunction with Figure 6, multiple processing circuits can form a closed loop, and each processing circuit in the closed loop can be configured with its own logical address. Further, the pre-operation circuit described in conjunction with Figure 2 can be configured to split the operation data according to the type of operation data (e.g., 32-bit data, 16-bit data, or 8-bit data) and the logical address, and pass the multiple sub-data obtained after splitting to the corresponding processing circuits in the loop for subsequent operation.
[0077] Figure 8a The diagram above shows four processing circuits connected to form a closed loop, and the physical addresses of these four processing circuits in right-to-left order can be represented as 0-1-2-3. Figure 8aThe diagram below shows the logical addresses of the four processing circuits in the aforementioned loop, from right to left, represented as 0-3-1-2. For example, Figure 8a The processing circuit shown in the figure below with logic address "3" has Figure 8a The physical address shown in the diagram above is "1".
[0078] In some application scenarios, it is assumed that the granularity of the manipulated data is the lower 128 bits of the input data, such as the original sequence "15, 14, ..., 2, 1, 0" in the figure (each number corresponds to 8 bits of data), and the logical addresses of these 16 8-bit data are numbered sequentially from low to high as 0 to 15. Further, according to... Figure 8a The logical address shown in the figure below indicates that the pre-operation circuit can encode or arrange data using different logical addresses according to different data types.
[0079] When the data width operated by the processing circuit is 32 bits, the four numbers (3,2,1,0), (7,6,5,4), (11,10,9,8), and (15,14,13,12) can represent the 0th to 3rd 32-bit data respectively. The pre-operation circuit can transfer the 0th 32-bit data to the processing circuit with logical address "0" (corresponding to physical address "0"), the 1st 32-bit data to the processing circuit with logical address "1" (corresponding to physical address "2"), the 2nd 32-bit data to the processing circuit with logical address "2" (corresponding to physical address "3"), and the 3rd 32-bit data to the processing circuit with logical address "3" (corresponding to physical address "1"). This data rearrangement is used to meet the subsequent computational needs of the processing circuit. Therefore, the final mapping relationship between the logical address and the physical address of the data is (15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0)->(11,10,9,8,7,6,5,4,15,14,13,12,3,2,1,0).
[0080] When the data width operated by the processing circuit is 16 bits, the eight numbers (1,0), (3,2), (5,4), (7,6), (9,8), (11,10), (13,12), and (15,14) can represent the 0th to 7th 16-bit data respectively. The pre-operation circuit can transfer the 0th and 4th 16-bit data to the processing circuit with logical address "0" (corresponding to physical address "0"), the 1st and 5th 16-bit data to the processing circuit with logical address "1" (corresponding to physical address "2"), the 2nd and 6th 16-bit data to the processing circuit with logical address "2" (corresponding to physical address "3"), and the 3rd and 7th 16-bit data to the processing circuit with logical address "3" (corresponding to physical address "1"). Therefore, the final mapping relationship between the logical address and the physical address of the data is: (15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0)->(13,12,5,4,11,10,3,2,15,14,7,6,9,8,1,0).
[0081] When the data width operated by the processing circuit is 8 bits, the 16 numbers with logical addresses 0 to 15 can represent the 0th to 15th 8-bit data respectively. According to Figure 8a As shown in the connection, the pre-operation circuit can transmit the 0th, 4th, 8th, and 12th 8-bit data to the processing circuit with logical address "0" (corresponding to physical address "0"); can transmit the 1st, 5th, 9th, and 13th 8-bit data to the processing circuit with logical address "1" (corresponding to physical address "2"); can transmit the 2nd, 6th, 10th, and 14th 8-bit data to the processing circuit with logical address "2" (corresponding to physical address "3"); and can transmit the 3rd, 7th, 11th, and 15th 8-bit data to the processing circuit with logical address "3" (corresponding to physical address "1"). Therefore, the final mapping relationship between the logical address and the physical address of the data is: (15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0)->(14,19,6,2,13,9,5,1,15,11,7,3,12,8,4,0).
[0082] Figure 8b The diagram above shows eight sequentially numbered processing circuits 0 through 7 connected to form a closed loop, and the physical addresses of these eight processing circuits are 0-1-2-3-4-5-6-7. Figure 8b The diagram below shows the logical addresses of the aforementioned eight processing circuits as 0-7-1-6-2-5-3-4. For example, Figure 8b The diagram above shows the processing circuit corresponding to physical address "6". Figure 8b The logical address shown in the diagram below is "3".
[0083] Figure 8b The diagram illustrates the operations of the pre-processing circuit in rearranging data before transmitting it to the corresponding processing circuit, for different data types. Figure 8a Similarly, therefore, combined Figure 8a The described technical solution is also applicable to Figure 8b The data rearrangement process described above will not be elaborated upon here. Furthermore, Figure 8b The connection relationship of the processing circuit shown is as follows Figure 8a Similar to what is shown, but Figure 8b Eight processing circuits are shown. Figure 8a This is twice the number of processing circuits shown. Therefore, in application scenarios that operate based on different data types, combined with... Figure 8b The granularity of the described operational data can be combined Figure 8a The granularity of the data being manipulated is twice that of the input data described. Therefore, instead of the lower 128 bits of the input data in the previous example, the granularity of the data being manipulated in this example can be the lower 256 bits of the input data, such as the original data sequence "31, 30, ..., 2, 1, 0" shown in the figure, where each number corresponds to an 8-bit ("bit") length.
[0084] For the original data sequence described above, the diagram also shows the data arrangement in the ring-shaped processing circuits when the data bit widths operated by the processing circuits are 32 bits, 16 bits, and 8 bits, respectively. For example, when the data bit width is 32 bits, the 32-bit data in the processing circuit with logical address "1" is (7, 6, 5, 4), and the physical address of this processing circuit is "2". When the data bit width is 16 bits, the two 16-bit data in the processing circuit with logical address "3" are (23, 22, 7, 6), and the physical address of this processing circuit is "6". When the data bit width is 8 bits, the four 8-bit data in the processing circuit with logical address "6" are (30, 22, 14, 6), and the physical address of this processing circuit is "3".
[0085] The above text combined Figure 8a and Figure 8b The multiple single-type processing circuits shown (such as) Figure 3 The first type of processing circuit shown is connected to form a closed loop, and data operations for different data types are described below. Figure 8c The following are several different types of processing circuits (such as...) Figure 4The first type of processing circuit and the second type of processing circuit shown are connected to form a closed loop, and further descriptions are made for data operations on different data types.
[0086] Figure 8c The diagram above shows twenty multi-type processing circuits numbered sequentially from 0, 1 to 19, connected to form a closed loop (the numbers shown in the diagram represent the physical addresses of the processing circuits). Sixteen processing circuits numbered from 0 to 15 constitute the first type of processing circuits (i.e., forming the processing circuit subarray disclosed herein), and four processing circuits numbered from 16 to 19 constitute the second type of processing circuits (i.e., forming the processing circuit subarray disclosed herein). Similarly, the physical address of each of these twenty processing circuits is... Figure 8c The logical addresses of the corresponding processing circuits shown in the diagram below have a mapping relationship.
[0087] Furthermore, when operating on different data types, such as the original sequence of 80 8-bit bytes shown in the diagram, Figure 8c The diagram also shows the results of operations performed on the aforementioned raw data for different data types supported by the processing circuit. For example, when the data width is 32 bits, one 32-bit data in the processing circuit with logical address "1" is (7,6,5,4), and the corresponding physical address of this processing circuit is "2". When the data width is 16 bits, two 16-bit data in the processing circuit with logical address "11" are (63,62,23,22), and the corresponding physical address of this processing circuit is "9". When the data width is 8 bits, four 8-bit data in the processing circuit with logical address "17" are (77,57,37,17), and the corresponding physical address of this processing circuit is "18".
[0088] Figure 9aFigures 9b, 9c, and 9d are schematic diagrams illustrating a data concatenation operation performed by a preprocessing circuit according to an embodiment of this disclosure. As previously mentioned, the preprocessing circuit described in conjunction with Figure 2 can also be configured to select a data concatenation pattern from multiple data concatenation patterns based on parsed instructions to perform a concatenation operation on two input data. Regarding multiple data concatenation patterns, in one embodiment, the scheme of this disclosure divides and numbers the two data to be concatenated according to the smallest data unit, and then extracts different smallest data units of the data based on specified rules to form different data concatenation patterns. For example, different data concatenation patterns can be formed by, for example, alternating extraction and placement based on the parity of the number or whether the number is an integer multiple of a specified number. Depending on different computing scenarios (e.g., different data bit widths), the smallest data unit here can be simply 1 bit or 1 bit of data, or 2 bits, 4 bits, 8 bits, 16 bits, or 32 bits or bits in length. Furthermore, when extracting different numbered portions of two data points, the disclosed scheme can extract them either alternately using the smallest data unit or in multiples of the smallest data unit. For example, it can alternately extract portions of two or three smallest data units from the two data points at a time as a group and then concatenate them in groups.
[0089] Based on the above description of the data splicing pattern, the following will combine... Figures 9a to 9c The data concatenation pattern disclosed herein is illustrated with a specific example. In the diagram shown, the input data are In1 and In2. When each square in the diagram represents a minimum data unit, both input data have a bit width of 8 minimum data units. As mentioned earlier, for data with different bit widths, the minimum data unit can represent different numbers of bits. For example, for 8-bit data, the minimum data unit represents 1 bit, while for 16-bit data, the minimum data unit represents 2 bits. As another example, for 32-bit data, the minimum data unit represents 4 bits.
[0090] like Figure 9aAs shown, the two input data sets In1 and In2 to be concatenated each consist of eight smallest data units numbered 1, 2, ..., 8 from right to left. The data concatenation follows an alternating principle: numbers from smallest to largest, In1 first, then In2, and odd numbers first, then even numbers. Specifically, when the data width is 8 bits, In1 and In2 each represent an 8-bit data set, and each smallest data unit represents 1 bit of data (i.e., one square represents 1 bit of data). Based on the data width and the aforementioned concatenation principle, the smallest data units numbered 1, 3, 5, and 7 of In1 are first extracted and arranged sequentially in the lower bits. Next, the four odd-numbered smallest data units of In2 are arranged sequentially. Similarly, the smallest data units numbered 2, 4, 6, and 8 of In1 and the four even-numbered smallest data units of In2 are arranged sequentially. Finally, 16 smallest data units are concatenated to form either one 16-bit or two 8-bit new data sets, as shown below. Figure 9a As shown in the second row of squares.
[0091] like Figure 9b As shown, when the data width is 16 bits, data In1 and In2 each represent a 16-bit data unit. In this case, each smallest data unit represents 2 bits of data (i.e., one square represents one 2-bit data unit). Based on the data width and the aforementioned interleaving principle, the smallest data units numbered 1, 2, 5, and 6 of data In1 are first extracted and arranged sequentially in the lower bits. Then, the smallest data units numbered 1, 2, 5, and 6 of data In2 are arranged sequentially. Similarly, the smallest data units numbered 3, 4, 7, and 8 of data In1 and the same as those in data In2 are arranged sequentially to form the final 16 smallest data units, consisting of one 32-bit or two 16-bit new data units, as shown below. Figure 9b As shown in the second row of squares.
[0092] like Figure 9c As shown, when the data width is 32 bits, data In1 and In2 each represent a 32-bit data unit, and each smallest data unit represents 4 bits of data (i.e., one square represents one 4-bit data unit). Based on the data width and the aforementioned interleaving principle, the smallest data units numbered 1, 2, 3, and 4 of data In1, which share the same number as data In2, are first extracted and arranged sequentially in the lower bits. Then, the smallest data units numbered 5, 6, 7, and 8 of data In1, which share the same number as data In2, are extracted and arranged sequentially, thus concatenating them to form a final 16 smallest data units, comprising either one 64-bit or two 32-bit new data units.
[0093] The above combination Figures 9a-9cThis disclosure describes an exemplary data concatenation method. However, it is understood that in some computing scenarios, data concatenation does not involve the aforementioned interleaving, but rather a simple arrangement of two data points while maintaining their original positions, for example... Figure 9d As shown. From Figure 9d As can be seen, the two data sets In1 and In2 do not execute as... Figures 9a-9c The interleaved arrangement shown is simply a concatenation of the last smallest data unit of data In1 and the first smallest data unit of data In2, thereby obtaining a new data type with increased bit width (e.g., doubled). In some scenarios, the scheme disclosed herein can also be used for grouped splicing based on data attributes. For example, neuron data or weight data with the same feature map can be grouped together and then arranged to form a continuous part of the spliced data.
[0094] Figure 10a Figures 10b and 10c are schematic diagrams illustrating data compression operations performed by a post-processing circuit according to embodiments of this disclosure. The compression operation may include filtering data using a mask or compressing data by comparing a given threshold with the data size. Regarding the data compression operation, it may be divided and numbered according to the smallest data unit as described above. Figures 9a-9d Similarly, the smallest data unit can be, for example, 1 bit or 1 byte of data, or a length of 2 bits, 4 bits, 8 bits, 16 bits, or 32 bits or 1 byte. The following will combine... Figures 10a to 10c Examples of different data compression modes are described.
[0095] like Figure 10a As shown, the original data consists of eight squares (i.e., eight smallest data units) numbered sequentially from right to left as 1, 2, ..., 8, assuming each smallest data unit can represent 1 bit of data. When performing data compression based on the mask, the post-processing circuit can use the mask to filter the original data to perform the data compression operation. In one embodiment, the bit width of the mask corresponds to the number of smallest data units in the original data. For example, if the aforementioned original data has 8 smallest data units, then the mask bit width is 8 bits, and the smallest data unit numbered 1 corresponds to the least significant bit of the mask, the smallest data unit numbered 2 corresponds to the second least significant bit, and so on, with the smallest data unit numbered 8 corresponding to the most significant bit. In one application scenario, when the 8-bit mask is "10010011", the compression principle can be set to extract the smallest data unit from the original data corresponding to the data bits where the mask value is "1". For example, the smallest data units corresponding to the mask value of "1" are numbered 1, 2, 5, and 8. Therefore, the smallest data units numbered 1, 2, 5, and 8 can be extracted and arranged in ascending order of their numbers to form compressed new data, such as... Figure 10a As shown in the second row.
[0096] Figure 10b Showing with Figure 10a Similar raw data, and from Figure 10b As can be seen from the second line, the data sequence after passing through the post-processing circuit maintains its original data arrangement order and content. Therefore, it can be understood that the data compression disclosed herein may also include a disabled mode or an uncompressed mode, so that no compression operation is performed when the data passes through the post-processing circuit.
[0097] like Figure 10c As shown, the original data consists of eight squares arranged sequentially. The number above each square represents its index, numbered 1, 2...8 from right to left, and it is assumed that each smallest data unit can be 8 bits. Furthermore, the number in each square represents the decimal value of that smallest data unit. Taking the smallest data unit numbered 1 as an example, its decimal value is "8", and the corresponding 8-bit data is "00001111". When performing data compression based on a threshold, assuming the threshold is the decimal number "8", the compression principle can be set to extract all smallest data units in the original data that are greater than or equal to the threshold "8". Thus, the smallest data units numbered 1, 4, 7, and 8 can be extracted. Then, all the extracted smallest data units are arranged in ascending order of their numbers to obtain the final data result, as shown below. Figure 10c As shown in the second line of the document.
[0098] Figure 11 This is a simplified flowchart illustrating a method 1100 for performing computational operations using a computing device according to an embodiment of this disclosure. Based on the foregoing description, it can be understood that the computing device here may be combined with… Figures 1-4 The described computing device has the processing circuit connections shown in Figures 5-10 and supports various additional operations.
[0099] like Figure 11 As shown, in step 1110, method 1100 receives computation instructions at the computing device and parses them to obtain multiple arithmetic instructions. Then, in step 1120, in response to receiving the multiple arithmetic instructions, method 1100 performs multi-threaded computation using the multiple processing circuit subarrays, wherein each of the multiple processing circuit subarrays is configured to execute at least one of the multiple arithmetic instructions.
[0100] For the sake of brevity, the above only combines... Figure 11 The calculation method disclosed herein is described. Those skilled in the art, based on the content of this disclosure, will also realize that this method may include further steps, and that the execution of these steps can achieve the combined effect described above. Figure 1-Figure 1 The various operations described in this disclosure are not repeated here.
[0101] Figure 12 This is a structural diagram illustrating a combined processing apparatus 1200 according to an embodiment of this disclosure. Figure 12 As shown, the combined processing device 1200 includes a computing processing device 1202, an interface device 1204, other processing devices 1206, and a storage device 1208. Depending on the application scenario, the computing processing device may include one or more computing devices 1210, which can be configured to perform the operations described herein. Figures 1-11 The described operation.
[0102] In different embodiments, the computing processing apparatus disclosed herein can be configured to perform user-specified operations. In exemplary applications, the computing processing apparatus can be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within the computing processing apparatus can be implemented as an artificial intelligence processor core or a portion of the hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as artificial intelligence processor cores or portions of the hardware structure of artificial intelligence processor cores, the computing processing apparatus disclosed herein can be considered to have a single-core structure or a homogeneous multi-core structure.
[0103] In exemplary operation, the computing processing device disclosed herein can interact with other processing devices through an interface device to jointly complete user-specified operations. Depending on the implementation, the other processing devices disclosed herein may include one or more types of processors such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and an artificial intelligence processor, both general-purpose and / or special-purpose processors. These processors may include, but are not limited to, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs. As mentioned above, the computing processing device disclosed herein can be considered to have a single-core structure or a homogeneous multi-core structure. However, when the computing processing device and other processing devices are considered together, they can be considered to form a heterogeneous multi-core structure.
[0104] In one or more embodiments, the other processing device may serve as an interface between the computing processing device disclosed herein (which may be specifically embodied in artificial intelligence, such as neural network operations) and external data and control, performing basic controls including but not limited to data transfer, starting and / or stopping the computing device. In another embodiment, the other processing device may also cooperate with the computing processing device to jointly complete computational tasks.
[0105] In one or more embodiments, the interface device can be used to transfer data and control commands between a computing processing device and other processing devices. For example, the computing processing device can obtain input data from other processing devices via the interface device and write it to on-chip storage (or memory) of the computing processing device. Further, the computing processing device can obtain control commands from other processing devices via the interface device and write them to on-chip control cache of the computing processing device. Alternatively or optionally, the interface device can also read data from the storage device of the computing processing device and transmit it to other processing devices.
[0106] Additionally or optionally, the combined processing apparatus disclosed herein may further include a storage device. As shown in the figures, the storage device is connected to both the computing processing device and the other processing device. In one or more embodiments, the storage device may be used to store data from the computing processing device and / or the other processing device. For example, the data may be data that cannot be fully stored in the internal or on-chip storage of the computing processing device or other processing device.
[0107] In some embodiments, this disclosure also discloses a chip (e.g. Figure 13 The chip shown is 1302. In one implementation, the chip is a system-on-chip (SoC) and integrates one or more such... Figure 12 The combined processing unit shown is illustrated. This chip can be connected to external interface devices (such as...). Figure 13 The external interface device 1306 shown is connected to other related components. These related components may be, for example, a camera, monitor, mouse, keyboard, network card, or Wi-Fi interface. In some applications, the chip may integrate other processing units (e.g., video codecs) and / or interface modules (e.g., DRAM interfaces). In some embodiments, this disclosure also discloses a chip package structure that includes the aforementioned chip. In some embodiments, this disclosure also discloses a board that includes the aforementioned chip package structure. The following will be combined with… Figure 13 This board is described in detail.
[0108] Figure 13 This is a schematic diagram illustrating the structure of a board 1300 according to an embodiment of this disclosure. For example... Figure 13 As shown, the board includes a storage device 1304 for storing data, which includes one or more storage cells 1310. This storage device can be connected and transmit data with the controller 1308 and the aforementioned chip 1302 via, for example, a bus. Furthermore, the board also includes an external interface device 1306, configured for data relay or switching between the chip (or a chip in a chip package) and an external device 1312 (e.g., a server or computer). For example, data to be processed can be transferred from the external device to the chip via the external interface device. Alternatively, the calculation results of the chip can be transmitted back to the external device via the external interface device. Depending on the application scenario, the external interface device can have different interface forms, such as a standard PCIe interface.
[0109] In one or more embodiments, the controller in the disclosed board can be configured to regulate the state of the chip. Therefore, in one application scenario, the controller may include a microcontroller (MCU) for regulating the operating state of the chip.
[0110] Based on the above combination Figure 12 and Figure 13 Based on the description, those skilled in the art will understand that this disclosure also discloses an electronic device or apparatus that may include one or more of the aforementioned boards, one or more of the aforementioned chips, and / or one or more of the aforementioned combined processing apparatus.
[0111] Depending on the application scenario, the electronic devices or apparatus disclosed herein may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablets, smart terminals, PC devices, IoT terminals, mobile terminals, mobile phones, dashcams, navigators, sensors, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and / or medical devices. The vehicles include airplanes, ships, and / or vehicles; the home appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, lights, gas stoves, and range hoods; the medical devices include MRI scanners, ultrasound machines, and / or electrocardiographs. The electronic devices or apparatus disclosed herein can also be applied in fields such as the Internet, IoT, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and healthcare. Furthermore, the electronic devices or apparatus disclosed herein can also be used in application scenarios related to artificial intelligence, big data, and / or cloud computing, such as cloud computing, edge computing, and terminal applications. In one or more embodiments, the high-computing-power electronic devices or apparatuses according to the present disclosure can be applied to cloud devices (e.g., cloud servers), while the low-power electronic devices or apparatuses can be applied to terminal devices and / or edge devices (e.g., smartphones or cameras). In one or more embodiments, the hardware information of the cloud devices and the hardware information of the terminal devices and / or edge devices are compatible with each other, so that suitable hardware resources can be matched from the hardware resources of the cloud devices to simulate the hardware resources of the terminal devices and / or edge devices based on the hardware information of the terminal devices and / or edge devices, so as to complete the unified management, scheduling and collaborative work of end-to-cloud or cloud-edge-end integration.
[0112] It should be noted that, for the sake of brevity, this disclosure describes some methods and their embodiments as a series of actions and combinations thereof. However, those skilled in the art will understand that the solutions disclosed herein are not limited by the order of the described actions. Therefore, based on the disclosure or teachings of this document, those skilled in the art will understand that some steps can be performed in a different order or simultaneously. Furthermore, those skilled in the art will understand that the embodiments described in this disclosure can be considered optional embodiments, that is, the actions or modules involved are not necessarily essential for the implementation of one or more solutions disclosed herein. In addition, depending on the solution, the description of some embodiments in this disclosure may have different emphases. In view of this, those skilled in the art will understand that parts not described in detail in a certain embodiment of this disclosure can also be referred to the relevant descriptions of other embodiments.
[0113] In terms of specific implementation, based on the disclosure and teachings of this document, those skilled in the art will understand that several embodiments disclosed herein can also be implemented in other ways not disclosed herein. For example, regarding the various units in the electronic device or apparatus embodiments described above, this document divides them based on logical functions, but in actual implementation, there may be other division methods. As another example, multiple units or components can be combined or integrated into another system, or some features or functions in a unit or component can be selectively disabled. Regarding the connection relationships between different units or components, the connections discussed above in conjunction with the accompanying drawings can be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect couplings involve communication connections utilizing interfaces, where the communication interface can support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
[0114] In this disclosure, the units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units. The aforementioned components or units may be located in the same location or distributed across multiple network units. Furthermore, depending on actual needs, some or all of the units can be selected to achieve the purpose of the solution described in the embodiments of this disclosure. Additionally, in some scenarios, multiple units in the embodiments of this disclosure may be integrated into one unit or each unit may exist physically independently.
[0115] In some implementation scenarios, the integrated unit described above can be implemented as a software program module. If implemented as a software program module and sold or used as an independent product, the integrated unit can be stored in a computer-readable storage device (CMSDD). Therefore, when the disclosed solution is embodied in a software product (e.g., a computer-readable storage medium), the software product can be stored in a memory, which may include several instructions to cause a computer device (e.g., a personal computer, server, or network device) to execute some or all of the steps of the method described in the embodiments of this disclosure. The aforementioned memory may include, but is not limited to, various media capable of storing program code, such as USB flash drives, flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.
[0116] In other implementation scenarios, the integrated units described above can also be implemented in hardware, i.e., as specific hardware circuits, which may include digital circuits and / or analog circuits. The physical implementation of the circuit's hardware structure may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors. Therefore, the various devices described herein (e.g., computing devices or other processing devices) can be implemented using appropriate hardware processors, such as CPUs, GPUs, FPGAs, DSPs, and ASICs. Furthermore, the aforementioned storage units or storage devices can be any suitable storage medium (including magnetic storage media or magneto-optical storage media), such as resistive random access memory (RRAM), dynamic random access memory (DRAM), static random access memory (SRAM), enhanced dynamic random access memory (EDRAM), high-bandwidth memory (HBM), hybrid memory cube (HMC), ROM, and RAM.
[0117] The foregoing can be better understood in accordance with the following terms:
[0118] Clause 1. A computing device, comprising:
[0119] A processing circuit array, comprising multiple processing circuits connected in a one-dimensional or multi-dimensional array structure, wherein the processing circuit array is configured as multiple processing circuit sub-arrays, and performs multi-threaded operations in response to receiving multiple arithmetic instructions, and each processing circuit sub-array is configured to execute at least one of the multiple arithmetic instructions.
[0120] The plurality of computation instructions are obtained by parsing the computation instructions received by the computing device.
[0121] Clause 2. The computing device according to Clause 1, wherein the opcode of the computing instruction represents a plurality of operations performed by the processing circuit array, the computing device further comprising a control circuit configured to acquire the computing instruction and parse the computing instruction to obtain the plurality of arithmetic instructions corresponding to the plurality of operations represented by the opcode.
[0122] Clause 3. The computing device according to Clause 2, wherein the control circuit configures the processing circuit array according to the plurality of arithmetic instructions to obtain the plurality of processing circuit subarrays.
[0123] Clause 4. The computing device according to Clause 3, wherein the control circuitry includes a register for storing configuration information, and the control circuitry extracts the corresponding configuration information according to the plurality of arithmetic instructions, and configures the processing circuitry array according to the configuration information to obtain the plurality of processing circuitry subarrays.
[0124] Clause 5. The computing device according to Clause 1, wherein the plurality of arithmetic instructions includes at least one multi-stage pipelined operation, wherein the multi-stage pipelined operation includes at least two arithmetic instructions.
[0125] Clause 6. The computing device according to Clause 1, wherein the arithmetic instruction includes a predicate, and each of the processing circuits determines whether to execute the arithmetic instruction associated therewith based on the predicate.
[0126] Clause 7. The computing device according to Clause 1, wherein the processing circuit array is a one-dimensional array, and one or more processing circuits in the processing circuit array are configured as a subarray of the processing circuits.
[0127] Clause 8. The computing device according to Clause 1, wherein the processing circuit array is a two-dimensional array, and wherein:
[0128] One or more rows of processing circuits in the processing circuit array are configured as a sub-array of the processing circuits; or
[0129] One or more columns of processing circuits in the processing circuit array are configured as a sub-array of the processing circuits; or
[0130] One or more rows of processing circuits along the diagonal direction in the processing circuit array are configured as a processing circuit subarray.
[0131] Clause 9. The computing device according to Clause 8, wherein the plurality of processing circuits located in the two-dimensional array are configured to be connected to one or more other processing circuits in the same row, column or diagonal in at least one of their row, column or diagonal directions in a predetermined two-dimensional spacing pattern.
[0132] Clause 10. The computing device according to Clause 9, wherein the predetermined two-dimensional spacing pattern is associated with the number of processing circuits spaced in the connection.
[0133] Clause 11. The computing device according to Clause 1, wherein the processing circuit array is a three-dimensional array, and one or more three-dimensional subarrays in the processing circuit array are configured as one processing circuit subarray.
[0134] Clause 12. The computing device according to Clause 11, wherein the three-dimensional array is a three-dimensional array consisting of multiple layers, wherein each layer comprises a two-dimensional array of multiple processing circuits arranged along the row direction, column direction, and diagonal direction, wherein:
[0135] The processing circuit located in the three-dimensional array is configured to be connected to one or more other processing circuits in the same row, column, diagonal, or different layers in at least one of its row, column, diagonal, and layer directions in a predetermined three-dimensional spacing pattern.
[0136] Clause 13. The computing device according to Clause 12, wherein the predetermined three-dimensional spacing pattern is associated with the number of spacings and the number of spacing layers between the processing circuits to be connected.
[0137] Clause 14. A computing device according to any one of Clauses 7-13, wherein a plurality of processing circuits in the processing circuit subarray form one or more closed loops.
[0138] Clause 15. The computing device according to Clause 1, wherein each of the said processing circuit subarrays is adapted to perform at least one of the following operations: arithmetic operations, logical operations, comparison operations, and table lookup operations.
[0139] Clause 16. The computing device according to Clause 1 further includes a data operation circuit, the data operation circuit including a pre-operation circuit and / or a post-operation circuit, wherein the pre-operation circuit is configured to perform preprocessing of input data for at least one of the arithmetic instructions, and the post-operation circuit is configured to perform post-processing of output data for at least one arithmetic instruction.
[0140] Clause 17. The computing device according to Clause 16, wherein the preprocessing includes data placement and / or table lookup operations, and the postprocessing includes data type conversion and / or compression operations.
[0141] Clause 18. The computing device according to Clause 18, wherein the data placement includes, according to the data type of the input data and / or output data of the operation instruction, splitting or merging the input data and / or output data accordingly, and then transmitting them to the corresponding processing circuit for operation.
[0142] Clause 19. An integrated circuit chip comprising a computing device according to any one of Clauses 1-18.
[0143] Clause 20. A board including an integrated circuit chip as described in Clause 19.
[0144] Clause 21. An electronic device comprising an integrated circuit chip as described in Clause 19.
[0145] Clause 22. A method of performing computation using a computing device, wherein the computing device includes a processing circuit array, the processing circuit array being composed of a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, and the processing circuit array being configured as a plurality of processing circuit subarrays, the method comprising:
[0146] The computing device receives calculation instructions and parses them to obtain multiple arithmetic instructions.
[0147] In response to receiving the plurality of arithmetic instructions, multi-threaded arithmetic is performed using the plurality of processing circuit subarrays, wherein each of the plurality of processing circuit subarrays is configured to execute at least one of the plurality of arithmetic instructions.
[0148] Clause 23. The method according to Clause 22, wherein the opcode of the computation instruction represents a plurality of operations performed by the processing circuit array, the computing device further comprising a control circuit, the method comprising using the control circuit to acquire the computation instruction and parse the computation instruction to obtain the plurality of arithmetic instructions corresponding to the plurality of operations represented by the opcode.
[0149] Clause 24. The method according to Clause 23, wherein the control circuit is used to configure the processing circuit array according to the plurality of arithmetic instructions to obtain the plurality of processing circuit subarrays.
[0150] Clause 25. The method according to Clause 24, wherein the control circuit includes a register for storing configuration information, and the method includes using the control circuit to extract corresponding configuration information according to the plurality of arithmetic instructions, and configuring the processing circuit array according to the configuration information to obtain the plurality of processing circuit subarrays.
[0151] Clause 26. The method according to Clause 22, wherein the plurality of operation instructions includes at least one multi-level pipeline operation, and the multi-level pipeline operation includes at least two operation instructions.
[0152] Clause 27. The method according to Clause 22, wherein the arithmetic instruction includes a predicate, and the method further includes using each of the processing circuits to determine, based on the predicate, whether to execute the arithmetic instruction associated therewith.
[0153] Clause 28. The method according to Clause 22, wherein the processing circuit array is a one-dimensional array, and the method includes configuring one or more processing circuits in the processing circuit array as a subarray of the processing circuits.
[0154] Clause 29. The method according to Clause 22, wherein the processing circuit array is a two-dimensional array, and the method further comprises:
[0155] One or more rows of processing circuits in the processing circuit array are configured as a sub-array of the processing circuit; or
[0156] One or more columns of processing circuits in the processing circuit array are configured as a sub-array of the processing circuits; or
[0157] One or more rows of processing circuits along the diagonal direction in the processing circuit array are configured as a subarray of the processing circuit.
[0158] Clause 30, the method according to Clause 29, wherein the plurality of processing circuits located in the two-dimensional array are configured to be connected to one or more other processing circuits in the same row, column, or diagonal in at least one of their row, column, or diagonal directions in a predetermined two-dimensional spacing pattern.
[0159] Clause 31, the method according to Clause 30, wherein the predetermined two-dimensional spacing pattern is associated with the number of processing circuits spaced in the connection.
[0160] Clause 32. The method according to Clause 22, wherein the processing circuit array is a three-dimensional array, and the method includes configuring one or more three-dimensional subarrays in the processing circuit array as a single processing circuit subarray.
[0161] Clause 33. The method according to Clause 32, wherein the three-dimensional array is a three-dimensional array consisting of multiple layers, wherein each layer comprises a two-dimensional array of multiple processing circuits arranged along row, column, and diagonal directions, the method comprising:
[0162] The processing circuit located in the three-dimensional array is configured to be connected to one or more other processing circuits in the same row, column, diagonal, or different layers in at least one of its row, column, diagonal, and layer directions in a predetermined three-dimensional spacing pattern.
[0163] Clause 34. The method according to Clause 33, wherein the predetermined three-dimensional spacing pattern is associated with the number of spacings and the number of spacing layers between the processing circuits to be connected.
[0164] Clause 35. The method according to any one of Clauses 28-34, wherein the plurality of processing circuits in the processing circuit subarray form one or more closed loops.
[0165] Clause 36. The method according to Clause 22, wherein each of the said processing circuit subarrays is adapted to perform at least one of the following operations: arithmetic operation, logical operation, comparison operation and table lookup operation.
[0166] Clause 37. The method according to Clause 1 further includes a data manipulation circuit, the data manipulation circuit including a pre-operation circuit and / or a post-operation circuit, the method including using the pre-operation circuit to perform preprocessing of input data for at least one of the arithmetic instructions and / or using the post-operation circuit to perform post-processing of output data for at least one arithmetic instruction.
[0167] Clause 38. The method according to Clause 37, wherein the preprocessing includes data placement and / or table lookup operations, and the postprocessing includes data type conversion and / or compression operations.
[0168] Clause 39. The method according to Clause 38, wherein the data placement includes splitting or merging the input data and / or output data according to the data type of the input data and / or output data of the operation instruction, and then transmitting them to the corresponding processing circuit for operation.
[0169] While numerous embodiments of this disclosure have been shown and described herein, it will be apparent to those skilled in the art that such embodiments are provided by way of example only. Many modifications, alterations, and alternatives will occur to those skilled in the art without departing from the spirit and intent of this disclosure. It should be understood that various alternatives to the embodiments of this disclosure described herein may be employed in the practice of this disclosure. The appended claims are intended to define the scope of this disclosure and therefore cover equivalents or alternatives within the scope of these claims.
Claims
1. A computing device, comprising: A processing circuit array, comprising multiple processing circuits connected in a one-dimensional or multi-dimensional array structure, wherein the processing circuit array is configured into multiple processing circuit subarrays in response to receiving multiple arithmetic instructions, wherein the multiple arithmetic instructions include a multi-stage pipelined operation, and each processing circuit subarray is configured to act as a different stage of the pipelined operation. The plurality of arithmetic instructions are obtained by parsing the arithmetic instructions received by the computing device, and The computing device further includes a control circuit, which configures the processing circuit array according to the corresponding configuration information extracted from the plurality of arithmetic instructions or configuration instructions to obtain the plurality of processing circuit sub-arrays.
2. The computing device according to claim 1, wherein the opcode of the computing instruction represents a plurality of operations executed by the processing circuit array, and the control circuit is further configured to acquire the computing instruction and parse the computing instruction to obtain the plurality of arithmetic instructions corresponding to the plurality of operations represented by the opcode.
3. The computing device according to claim 1, wherein the control circuit includes a register for storing configuration information, and the control circuit extracts corresponding configuration information from the register according to the plurality of arithmetic instructions, and configures the processing circuit array according to the configuration information to obtain the plurality of processing circuit subarrays.
4. The computing device according to claim 1, wherein the plurality of arithmetic instructions includes at least one multi-stage pipelined operation, and the multi-stage pipelined operation includes at least two arithmetic instructions.
5. The computing device of claim 1, wherein the arithmetic instruction includes a predicate, and each of the processing circuits determines whether to execute the arithmetic instruction associated therewith based on the predicate.
6. The computing device of claim 1, wherein the processing circuit array is a one-dimensional array, and one or more processing circuits in the processing circuit array are configured as a subarray of the processing circuits.
7. The computing device of claim 1, wherein the processing circuit array is a two-dimensional array, and wherein: One or more rows of processing circuits in the processing circuit array are configured as a sub-array of the processing circuits; or One or more columns of processing circuits in the processing circuit array are configured as a sub-array of the processing circuits; or One or more rows of processing circuits along the diagonal direction in the processing circuit array are configured as a processing circuit subarray.
8. The computing device of claim 7, wherein the plurality of processing circuits located in the two-dimensional array are configured to be connected to one or more other processing circuits in the same row, column, or diagonal in at least one of their row, column, or diagonal directions in a predetermined two-dimensional spacing pattern.
9. The computing device of claim 8, wherein the predetermined two-dimensional spacing pattern is associated with the number of processing circuits spaced apart in the connection.
10. The computing device of claim 1, wherein the processing circuit array is a three-dimensional array, and one or more three-dimensional subarrays in the processing circuit array are configured as one processing circuit subarray.
11. The computing device of claim 10, wherein the three-dimensional array is a three-dimensional array consisting of multiple layers, wherein each layer comprises a two-dimensional array of multiple processing circuits arranged along the row direction, column direction, and diagonal direction, wherein: The processing circuit located in the three-dimensional array is configured to be connected to one or more other processing circuits in the same row, column, diagonal, or different layers in at least one of its row, column, diagonal, and layer directions in a predetermined three-dimensional spacing pattern.
12. The computing device of claim 11, wherein the predetermined three-dimensional spacing pattern is associated with the number of spacings and the number of spacing layers between the processing circuits to be connected.
13. The computing device of claim 6, wherein the plurality of processing circuits in the processing circuit subarray form one or more closed loops.
14. The computing device of claim 1, wherein each of the processing circuit subarrays is adapted to perform at least one of the following operations: arithmetic operations, logical operations, comparison operations, and table lookup operations.
15. The computing device of claim 1, further comprising a data operation circuit including a pre-operation circuit and / or a post-operation circuit, wherein the pre-operation circuit is configured to perform preprocessing of input data for at least one of the arithmetic instructions, and the post-operation circuit is configured to perform post-processing of output data for at least one arithmetic instruction.
16. The computing device of claim 15, wherein the preprocessing includes data placement and / or table lookup operations, and the postprocessing includes data type conversion and / or compression operations.
17. The computing device according to claim 16, wherein the data placement includes, according to the data type of the input data and / or output data of the operation instruction, splitting or merging the input data and / or output data accordingly, and then transmitting them to the corresponding processing circuit for operation.
18. An integrated circuit chip comprising a computing device according to any one of claims 1-17.
19. A board comprising the integrated circuit chip according to claim 18.
20. An electronic device comprising the integrated circuit chip according to claim 18.
21. A method of performing computation using a computing device, wherein the computing device includes an array of processing circuits, the array of processing circuits being formed by connecting a plurality of processing circuits in a one-dimensional or multi-dimensional array structure, the method comprising: The computing device receives computing instructions and parses them to obtain multiple arithmetic instructions; In response to receiving the plurality of computation instructions, wherein the plurality of computation instructions include a multi-stage pipelined operation, the processing circuit array is configured into a plurality of processing circuit sub-arrays, each processing circuit sub-array acting as a different stage of the pipelined operation. The computing device further includes a control circuit, which uses the corresponding configuration information extracted by the control circuit according to the plurality of arithmetic instructions or configuration instructions to configure the processing circuit array to obtain the plurality of processing circuit sub-arrays.
22. The method of claim 21, wherein the opcode of the computation instruction represents a plurality of operations performed by the processing circuit array, the method comprising using the control circuit to acquire the computation instruction and parse the computation instruction to obtain the plurality of arithmetic instructions corresponding to the plurality of operations represented by the opcode.
23. The method of claim 21, wherein the control circuit includes a register for storing configuration information, and the method includes using the control circuit to extract corresponding configuration information from the register according to the plurality of arithmetic instructions, and configuring the processing circuit array according to the configuration information to obtain the plurality of processing circuit subarrays.
24. The method according to claim 21, wherein the plurality of operation instructions includes at least one multi-stage pipeline operation, and the multi-stage pipeline operation includes at least two operation instructions.
25. The method of claim 21, wherein the arithmetic instruction includes a predicate, and the method further includes using each of the processing circuits to determine, based on the predicate, whether to execute the arithmetic instruction associated therewith.
26. The method of claim 21, wherein the processing circuit array is a one-dimensional array, and the method includes configuring one or more processing circuits in the processing circuit array as a subarray of the processing circuits.
27. The method of claim 21, wherein the processing circuit array is a two-dimensional array, and the method further comprises: One or more rows of processing circuits in the processing circuit array are configured as a subarray of the processing circuit; or One or more columns of processing circuits in the processing circuit array are configured as a sub-array of the processing circuits; or One or more rows of processing circuits along the diagonal direction in the processing circuit array are configured as a subarray of the processing circuit.
28. The method of claim 27, wherein the plurality of processing circuits located in the two-dimensional array are configured to be connected to one or more other processing circuits in the same row, column, or diagonal in at least one of their row, column, or diagonal directions in a predetermined two-dimensional spacing pattern.
29. The method of claim 28, wherein the predetermined two-dimensional spacing pattern is associated with the number of processing circuits spaced in the connection.
30. The method of claim 21, wherein the processing circuit array is a three-dimensional array, and the method includes configuring one or more three-dimensional subarrays in the processing circuit array as a single processing circuit subarray.
31. The method of claim 30, wherein the three-dimensional array is a three-dimensional array consisting of multiple layers, wherein each layer comprises a two-dimensional array of multiple processing circuits arranged along row directions, column directions, and diagonal directions, the method comprising: The processing circuit located in the three-dimensional array is configured to be connected to one or more other processing circuits in the same row, column, diagonal, or different layers in at least one of its row, column, diagonal, and layer directions in a predetermined three-dimensional spacing pattern.
32. The method of claim 31, wherein the predetermined three-dimensional spacing pattern is associated with the number of spacings and the number of spacing layers between the processing circuits to be connected.
33. The method according to any one of claims 26-32, wherein the plurality of processing circuits in the processing circuit subarray form one or more closed loops.
34. The method of claim 21, wherein each of the processing circuit subarrays is adapted to perform at least one of the following operations: arithmetic operations, logical operations, comparison operations, and table lookup operations.
35. The method of claim 21, further comprising a data manipulation circuit including a pre-operation circuit and / or a post-operation circuit, the method comprising using the pre-operation circuit to perform preprocessing of input data for at least one of the arithmetic instructions and / or using the post-operation circuit to perform post-processing of output data for at least one arithmetic instruction.
36. The method of claim 35, wherein the preprocessing includes data placement and / or table lookup operations, and the postprocessing includes data type conversion and / or compression operations.
37. The method according to claim 36, wherein the data placement includes, according to the data type of the input data and / or output data of the operation instruction, splitting or merging the input data and / or output data accordingly, and then transmitting them to the corresponding processing circuit for operation.