Arithmetic processing unit and arithmetic processing method

The arithmetic processing device optimizes super-resolution processing by sharing weights and address generation across multiple arithmetic units, reducing computational resources and enhancing efficiency.

JP2026096756APending Publication Date: 2026-06-15AKUSERU KK

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
AKUSERU KK
Filing Date
2024-12-03
Publication Date
2026-06-15

Smart Images

  • Figure 2026096756000001_ABST
    Figure 2026096756000001_ABST
Patent Text Reader

Abstract

Reduce the computing resources of the processing unit used for super-resolution processing. [Solution] The system comprises multiple computing cores (20, 30, 40, 50) that perform convolution processing using the same trained model, a first bus connecting the computing units included in the preceding computing core and the foremost computing unit of the subsequent computing core, and a weight memory 21 provided only in the foremost computing core 20 that stores the weights to be transferred to each computing core. The computing units included in the computing cores output the input weights to the foremost computing unit included in the subsequent computing cores, thereby transferring weights between the computing cores.
Need to check novelty before this filing date? Find Prior Art

Description

【Technical Field】 【0001】 The present invention relates to an arithmetic processing device and an arithmetic processing method. 【Background Art】 【0002】 Super-resolution technology for generating a high-resolution image from a low-resolution image is known. Such super-resolution processing is also utilized in the field of gaming machines. It has been considered to execute super-resolution processing that was previously performed by calculation algorithms such as bilinear and bicubic using inference processing with a learned model of machine learning including a network structure and weights. In the following description, super-resolution processing using inference processing is also referred to as super-resolution processing using AI or simply super-resolution processing. Convolution processing is an important process in super-resolution processing. A processor having a characteristic configuration for performing convolution processing is disclosed in Patent Document 1. The processor disclosed in Patent Document 1 includes a plurality of processing cores each having a plurality of processing elements arranged in a matrix called a systolic array. The processor of Patent Document 1 inputs information on low-resolution data to be subjected to super-resolution processing and weight data based on a learned model to each processing core and performs convolution processing. More specifically, input data is supplied in the row direction of the systolic array, weight data is supplied in the column direction of the systolic array, and the input data and the weight data are sequentially multiplied and integrated by the processing elements based on the network structure based on the learned model. 【Prior Art Documents】 【Patent Documents】 【0003】 【Patent Document 1】 Japanese Unexamined Patent Application Publication No. 2020-77298 【Summary of the Invention】 【Problems to be Solved by the Invention】 【0004】 By linking together numerous processing cores with systolic arrays, as described in Reference 1, and having each processing core process the divided low-resolution data, it is expected that super-resolution processing, including convolution, can be accelerated. On the other hand, there is a concern that having many processing cores will increase computing resources. One aspect of the present invention is to reduce the computational resources of a processing unit that performs convolution processing, such as the one disclosed in Patent Document 1. [Means for solving the problem] 【0005】 The information processing system of the present invention is characterized in that, as one aspect, it is an arithmetic processing device comprising a plurality of arithmetic unit arrays, each including a plurality of arithmetic units, which perform convolution processing using the same trained model, wherein the plurality of arithmetic units included in the arithmetic unit arrays output weights input from a preceding arithmetic unit to a subsequent arithmetic unit and process data using the input weights, and the arithmetic processing device comprises a first storage unit which is provided only in the foremost arithmetic unit array and stores weights to be transferred to each of the arithmetic units, and a first bus which connects the preceding arithmetic unit array and the subsequent arithmetic unit array, and any of the arithmetic units included in the preceding arithmetic unit arrays output the input weights to the foremost arithmetic unit included in the subsequent arithmetic unit arrays via the first bus, thereby transferring weights between the arithmetic unit arrays. [Effects of the Invention] 【0006】 One aspect of the present invention is that the computational resources of the processing unit that performs convolution can be reduced. [Brief explanation of the drawing] 【0007】 [Figure 1] This diagram shows a conventional processing unit that performs super-resolution processing. [Figure 2] This figure illustrates the super-resolution processing using the processing unit shown in Figure 1. [Figure 3] This figure shows the arithmetic processing unit of Figure 1 in more detail. [Figure 4] This diagram illustrates the arithmetic processing performed by the arithmetic processing unit in Figure 3. [Figure 5] This diagram illustrates the configuration of the arithmetic processing unit in this embodiment. [Figure 6] This figure provides a more detailed explanation of the processing unit shown in Figure 5. [Figure 7] Figure 6 is a diagram illustrating the arithmetic processing performed by the arithmetic processing unit. [Figure 8] This is a flowchart showing the processes performed by the processing unit. [Figure 9] This is a diagram illustrating the general configuration of a gaming machine. [Figure 10] This is a diagram showing the configuration of the performance control system. [Modes for carrying out the invention] 【0008】 Embodiments of the present invention will be described in detail below with reference to the drawings. The present invention is not limited in any way to the following embodiments, and can be implemented with appropriate modifications within the scope of the object of the present invention. The arithmetic processing unit of this embodiment is an arithmetic processing unit equipped with an arithmetic core having a systolic array composed of multiple arithmetic cores. The systolic array is configured to efficiently perform convolution processing. The following description will use super-resolution processing, which includes convolution processing using a systolic array, as an example, but the arithmetic processing unit of this embodiment can be applied to convolution processing in general. Refer to Figures 1 and 2 to explain conventional super-resolution processing using systolic arrays. Figure 1 shows a conventional processing unit that performs super-resolution processing. The arithmetic processing unit 200 shown in Figure 1 comprises a control circuit 70, an arithmetic core A, an arithmetic core B, an arithmetic core C, and an arithmetic core D. 【0009】 Figure 2 illustrates the super-resolution processing using the processing unit shown in Figure 1. The processing unit 200 generates a high-resolution image P0 shown in Figure 2(d) by performing inference processing on the low-resolution image p0 shown in Figure 2(a) as the input image. More specifically, the control circuit 70 divides the low-resolution image p0 input to the arithmetic processing unit 200 to generate low-resolution divided images p1 to p4, as shown in Figure 2(b). The control circuit 70 inputs the generated segmented images p1 to p4 to the processing cores A, B, C, and D, respectively. Computation cores A, B, C, and D each perform super-resolution processing, including convolution, on the input segmented images p1 to p4, thereby generating high-resolution segmented images P1 to P4 as shown in Figure 2(c). 【0010】 The control circuit 70 further combines the divided images P1 to P4 to generate a high-resolution image P0, which is the same image p0 shown in Figure 2(d). The number of divisions in a low-resolution image can correspond to the number of processing cores in the processing unit 200. In other words, in super-resolution processing, multiple processing cores are used in coordination to process the low-resolution input image (input data) according to its size. By increasing the resolution of each divided image with the corresponding processing core, super-resolution processing can be performed efficiently. The more processing units that constitute the systolic array included in the processing core (described later) there are, the larger the size of the divided images that the processing core can convolve. Therefore, the more processing units a processing core has, the fewer divisions the low-resolution image can be reduced. 【0011】 Figure 3 is a diagram that shows the arithmetic processing unit of Figure 1 in more detail. Figure 4 is a diagram illustrating the arithmetic processing performed by the arithmetic processing unit shown in Figure 3. The arithmetic processing unit 200 includes a control circuit 70, and a plurality of arithmetic cores A, arithmetic core B, arithmetic core C, and arithmetic core D. The arithmetic cores A, arithmetic core B, arithmetic core C, and arithmetic core D have the same configuration. Each arithmetic core includes a weight memory 81, a data memory 82, an address generator 83, a weight output unit 84, a data output unit 85, a plurality of arithmetic units 86 that construct a systolic array, an output memory 87, and an arithmetic result output unit 88. In the drawings after FIG. 3, the weight memory is denoted as WM, and the data memory is denoted as DM. Also, the address generator is denoted as AG, the weight output unit is denoted as WO. Also, the output memory is denoted as OM, the arithmetic unit is denoted as PE, and the arithmetic result output unit is denoted as OO. 【0012】 The control circuit 70 stores the weights used in the convolution process in the weight memory 81 of each arithmetic core according to the learned model. Also, when performing super-resolution processing, the control circuit 70 stores the input data corresponding to the low-resolution divided images p1 to p4 shown in FIG. 2(b) obtained by dividing the low-resolution input image p0 shown in FIG. 2(a) in the data memory 82 of each arithmetic core. Then, the control circuit 70 creates address generation information according to the learned model and outputs it to the address generator 83 of each arithmetic core. 【0013】 As a result, the following processing is performed in each arithmetic core. The address generator 83 generates various addresses from the input address generation information and inputs the generated addresses to the weight output unit 84, the data output unit 85, and the arithmetic result output unit 88, respectively. A weight address indicating the output order of weights is input to the weight output unit 84. A data address indicating the output order of input data is input to the data output unit 85. A write address indicating the data memory where the arithmetic result is to be written is input to the arithmetic result output unit 88. An output address indicating the output order of the arithmetic result is input to the output memory 87 via the data output unit 85. The weight output unit 84 outputs the weights according to the weight addresses. The data output unit 85 outputs the input data according to the data address. 【0014】 As a result, the arithmetic unit 86 performs a sum-of-products operation on the input data Dxx written to the data memory 82 and the weight xx written to the weight memory 81, as shown in Figure 4, and stores the calculation result in the output memory 87. As an example, in the frontmost processing core A, D11*W11+D12*W12+D13*W13, D11*W21+D12*W22+D13*W23, D11*W31+D12*W32+D13*W33 D21*W11+D22*W12+D23*W13, D21*W21+D22*W22+D23*W23, D21*W31+D22*W32+D23*W33 D31*W11+D32*W12+D33*W13, D31*W21+D32*W22+D33*W23, D31*W31+D32*W32+D33*W33 This is written to output memory 87. In the subsequent processing core B, D41*W11+D42*W12+D43*W13, D41*W21+D42*W22+D43*W23, D41*W31+D42*W32+D43*W33 D51*W11+D52*W12+D53*W13, D51*W21+D52*W22+D53*W23, D51*W31+D52*W32+D53*W33 D61*W11+D62*W12+D63*W13, D61*W21+D62*W22+D63*W23, D61*W31+D62*W32+D63*W33 This is written to output memory 87. The arithmetic unit 86 may, if necessary, perform addition operations that add a value called a bias term in addition to the sum-of-products operation. 【0015】 The weights input to each processing core are the same. Furthermore, the input data to each of the processing cores A through D corresponds to the values ​​of the divided images p1 through p4, respectively. Between the arithmetic unit 86 and the output memory 87, there is a processing block (not shown) that performs pooling and activation function processing. The control circuit 70 then issues instructions to the processing block according to the learned model, performing pooling and activation function processing on the processing results of each arithmetic unit 86, and storing the results as calculation results in the output memory 87. 【0016】 The output memory 87 outputs the calculation result to the calculation result output unit 88 according to the output address. When the calculation result is input from the output memory 87, the calculation result output unit 88 outputs the calculation result to the data memory 82, the write destination, according to the write address. Each processing core performs a convolution process by repeating the above-described process, generating high-resolution segmented images P1 to P4 shown in Figure 2(c). Furthermore, the control circuit 70 combines the segmented images P1 to P4 to generate a high-resolution image P0 shown in Figure 2(d), which is a higher-resolution version of the image p0 shown in Figure 2(a). 【0017】 The arithmetic processing unit 200 described in Figures 2 to 4 is a combination of arithmetic cores having a well-known configuration, applied to super-resolution processing using inference processing. The arithmetic processing unit 100 described below has a configuration that is further optimized for super-resolution processing in order to reduce computing resources. Each processing core in the processing unit 200 uses the segmented images obtained by dividing the input image as input data and performs inference processing using the same pre-trained model. In the arithmetic processing unit 200, each arithmetic core is individually input with the same weight value, and calculations are performed using the same weight as the address generated by the address generator. In such a arithmetic processing unit 200, if the addresses and weights generated by the foremost arithmetic core A are shared with the subsequent arithmetic cores B, C, and D, it becomes possible to perform the desired convolution process without individually inputting weights to the subsequent arithmetic cores or generating various addresses in the subsequent arithmetic cores. Focusing on the above points, the arithmetic processing unit 100 of this embodiment reduces computing resources. 【0018】 Figure 5 is a diagram illustrating the configuration of the arithmetic processing unit of this embodiment. The configuration of the arithmetic processing unit 100 will be described with reference to Figure 5. For the sake of simplicity, in the following explanation, we will assume that the arithmetic processing unit 100 contains four arithmetic cores. Furthermore, as detailed in Figure 6, each arithmetic core's systolic array SA (arithmetic unit array) contains nine arithmetic units, arranged in a 3x3 grid. However, the number of arithmetic cores in the arithmetic processing unit 100 and the number of arithmetic units in the systolic array SA may be set to any number as appropriate, depending on the type and size of the data being handled. The arithmetic processing unit 100 has an optimized configuration using convolution processing that can be used for super-resolution processing, thereby reducing computing resources (number of address generators and memory usage) compared to the arithmetic processing unit 200. The super-resolution processing is the same as the super-resolution processing performed by the arithmetic processing unit 200 as described in Figures 1 to 4. That is, a high-resolution image is generated from a low-resolution image as described in Figure 2. 【0019】 Please refer to Figure 5 for further explanation. The arithmetic processing unit 100 comprises a control circuit 10 and arithmetic cores 20, 30, 40, and 50. In the following description, when arithmetic cores 20, 30, 40, and 50 are not distinguished, they will simply be referred to as arithmetic cores. In the arithmetic processing unit 100, only the front-most arithmetic core 20 is equipped with a weight memory 21, which will be described later, and the weights are shared among the arithmetic cores. Furthermore, the arithmetic processing unit 100 is equipped with an address generator 23, described later, only in the foremost arithmetic core 20, and the data address, write address, and output address generated by this single address generator are shared among the arithmetic cores. The arithmetic cores other than the foremost arithmetic core 20 do not input weights to the weight memory or generate addresses using the address generator. Therefore, the number of weight memory, weight output unit, and address generator in the arithmetic processing unit 100 as a whole can be reduced. Even when increasing the number of arithmetic cores in the arithmetic processing unit 100, the weight memory, weight output unit, and address generator are only needed for the first-stage arithmetic core 20. Consequently, the reduction in computing resources in the arithmetic processing unit 100 is dramatic. 【0020】 The arithmetic processing unit 100 includes a transfer bus 60 that connects each of the arithmetic cores 20, 30, 40, and 50. The transfer bus 60 connects the last arithmetic unit included in the preceding arithmetic core to the first arithmetic unit included in the subsequent arithmetic cores 30, 40, and 50. The transfer bus 60 is used to transfer weights between the arithmetic cores. The arithmetic processing unit 100 shares the weights stored in the weight memory of the first arithmetic core 20 with the arithmetic cores 30, 40, and 50 via the transfer bus 60. Specifically, the arithmetic core 20 transfers weights to the arithmetic core 30 via the transfer bus 60, the arithmetic core 30 transfers weights to the arithmetic core 40 via the transfer bus 60, and the arithmetic core 40 transfers weights to the arithmetic core 50 via the transfer bus 60. Therefore, the arithmetic processing unit 100 can perform super-resolution processing without providing weight memory in the arithmetic cores 30, 40, and 50. 【0021】 The arithmetic processing unit 100 further includes a transfer bus 90. The transfer bus 90 also connects the arithmetic cores 20, 30, 40, and 50, respectively. The arithmetic processing unit 100 then shares the data address, output address, and write address generated by the address generator 23 of the foremost arithmetic core 20 with the arithmetic cores 30, 40, and 50 via the transfer bus 90. Specifically, the arithmetic core 20 transfers the above address to the arithmetic core 30 via the transfer bus 90, the arithmetic core 30 transfers the above address to the arithmetic core 40 via the transfer bus 90, and the arithmetic core 40 transfers the above address to the arithmetic core 50 via the transfer bus 90. Therefore, the arithmetic processing unit 100 can perform super-resolution processing without providing address generators in the arithmetic cores 30, 40, and 50. The data address indicates the order in which the input data stored in the data memory is transferred (output) to the arithmetic unit. The address generated by the address generator 23 of the arithmetic core 20 is transferred and shared among the data output units of each arithmetic core. The write address indicates the data memory to which the calculation result input from the output memory is written. The address generated by the address generator 23 of the calculation core 20 is shared among the data output units of each calculation core. The output address indicates the output order of the calculation results stored in the output memory. The address generated by the address generator 23 of the calculation core 20 is shared by the output memory of each calculation core. 【0022】 Figure 6 is a diagram that explains the arithmetic processing unit of Figure 5 in more detail. Figure 7 illustrates the arithmetic processing performed by the arithmetic processing unit shown in Figure 6. This will be explained with reference to Figure 6. As explained in Figure 5, the arithmetic processing unit 100 comprises a control circuit 10 and arithmetic cores 20, 30, 40, and 50. The arithmetic core 20 includes a weight memory 21, a data memory 22, an address generator 23, a weight output unit 24, a data output unit 25, a plurality of arithmetic units 26 that construct a systolic array SA, an output memory 27, and an arithmetic result output unit 28. The weight memory 21 is an example of the first storage unit. The address generator 23 is an example of the generation unit. The weight output unit 24 is an example of the first output unit. 【0023】 The arithmetic core 30 includes a data memory 32, a data output unit 35, a plurality of arithmetic units 36 that construct a systolic array SA, an output memory 37, and an arithmetic result output unit 38, but does not include an address generator, a weight memory, or a weight output unit. The arithmetic core 40 includes a data memory 42, a data output unit 45, a plurality of arithmetic units 46 that construct a systolic array SA, an output memory 47, and an arithmetic result output unit 48, but does not include an address generator, weight memory, or weight output unit. The arithmetic core 50 includes a data memory 52, a data output unit 55, a plurality of arithmetic units 56 that construct a systolic array SA, an output memory 57, and an arithmetic result output unit 58, but does not include an address generator, weight memory, or weight output unit. In each computing core, the systolic array SA has, for example, computing units arranged in a matrix. 【0024】 In the following explanation, when data memories 22, 32, 42, and 52 are not distinguished, they will simply be referred to as data memory. Data memory is an example of the second storage unit. When data output units 25, 35, 45, and 55 are not distinguished, they are simply referred to as data output units. A data output unit is an example of a second output unit. When arithmetic units 26, 36, 46, and 56 are not distinguished, they are simply called arithmetic units. An arithmetic unit is an example of an arithmetic unit. When output memories 27, 37, 47, and 57 are not distinguished, they are simply referred to as output memories. Output memories are an example of a third storage unit. When the calculation result output units 28, 38, 48, and 58 are not distinguished, they are simply referred to as the calculation result output unit. The calculation result output unit is an example of a third output unit. 【0025】 Each computational core performs convolution using the same pre-trained model. Each computational unit outputs the weights input from the preceding computational unit to the subsequent computational unit, and also processes the data using the input weights. The final processing unit 56 stores the calculation results, which were obtained by processing the data along the path indicated by the vertical arrow in Figure 6, into the output memory 57. 【0026】 The control circuit 70 stores the weights to be used in the convolution process in the weight memory 21 of the front-most arithmetic core 20, according to the trained model. As shown in Figure 6, of the connected arithmetic cores of the arithmetic processing unit 100, only the front-most arithmetic core 20 is equipped with weight memory (weight memory 21). The weights stored in the weight memory 21 are then sequentially transferred to each subsequent arithmetic unit via a vertical arrow path between the arithmetic units in the systolic array SA within the arithmetic core. At this time, the weights stored in the weight memory 21 are also transferred between the processing cores via the transfer bus 60. Weights are transferred via the transfer bus 60 from the last processing unit in the systolic array SA of the preceding processing core to the first processing unit in the systolic array SA of the next processing core. For example, weights are transferred from the last stage 26A arithmetic unit 26 of the systolic array SA of the arithmetic core 20 to the first stage 36A arithmetic unit 36 ​​of the systolic array SA of the subsequent arithmetic core 30 via a path indicated by a vertical arrow. 【0027】 The weights may be transferred via the transfer bus 60 from a processing unit other than the last processing unit of the preceding processing core to the first processing unit of the next processing core, rather than from the last processing unit of the preceding processing core. As an example of this case, weights can be transferred from the arithmetic unit 26 of the middle stage 36B in the systolic array SA of the arithmetic core 30 to the arithmetic unit 46 of the front stage 46A in the systolic array SA of the downstream arithmetic core 40 via a path indicated by a vertical arrow. Alternatively, weights can be transferred from the first-stage arithmetic unit 46A of the systolic array SA of the arithmetic core 40 to the first-stage arithmetic unit 56A of the systolic array SA of the subsequent arithmetic core 40 via a path indicated by a vertical arrow. Since the same weights are used in the vertical processing units, it is the same regardless of which processing unit transfers the weights to the next processing core. The display in Figure 6 is an example. In all processing cores except the last stage, the weights may be output from the last processing unit to the subsequent processing cores via the transfer bus 60. As a variation of this, in all processing cores except the last stage, weights may be output from any processing unit other than the last stage. The stage of the processing unit that outputs weights to the subsequent processing cores may differ for each processing core. Within the same processing core, the stage of the processing unit that outputs weights to the subsequent processing cores may differ. 【0028】 In Figure 6, the computing core has three weight memories 21. However, if the weight memories 21 are made of RAM, the three regions contained in the RAM may be used as each weight memory 21. When performing super-resolution processing, the control circuit 70 stores the input data corresponding to the low-resolution segmented images p1 to p4 shown in Figure 2(b) in the data memories 22, 32, 42, and 52 of the arithmetic cores 20, 30, 40, and 50, respectively. In Figure 6, each computing core has three data memories. However, if the data memory is configured as RAM, the three regions contained within the RAM may be used as separate data memories. 【0029】 The control circuit 70 then creates address generation information according to the learned model and outputs it to the address generator 23, which is only provided by the front-end processing core 20. The address generator 23 generates various addresses from the input address generation information and inputs the generated addresses to the weight output unit 24 and the data output unit 25, respectively. 【0030】 The weight output unit 24 of the front-most arithmetic core 20 receives a weight address (first address) that indicates the order in which the weights stored in the weight memory 21 should be output to the front-most arithmetic unit 26. The weight address indicates the order in which the weights should be output to a single arithmetic core. As described above, the weights are shared among the arithmetic cores via the transfer bus 60, and the convolution process is performed using the same weights in each arithmetic core. 【0031】 The weight output unit 24 reads the weights from the weight memory 21 according to the input weight address and outputs the weights to the first-stage arithmetic unit 26 included in the arithmetic core 20. As a result, the weights output from the weight memory 21 are sequentially transferred to each of the last-stage arithmetic units 56 of the arithmetic core 50 via the paths of the vertical arrows in Figure 6, which are input and output between each arithmetic unit. A transfer bus 60 is used for transferring weights between arithmetic cores. 【0032】 The data output unit 25 of the foremost processing core 20 receives the data address (second address) generated by the address generator 23. The data output unit 25 reads the input data from the connected data memory 22 according to the input data address and outputs the input data to the lateral processing unit 26 at the forefront. As a result, the input data is transferred laterally in the processing core 20 from the forefront processing unit 26 to the final processing unit 26. 【0033】 The data output unit 25 of the foremost processing core 20 receives the output address and write address generated by the address generator 23 as input. The data output unit 25 outputs the input output address and write address to the connected output memory 27. In other words, the output address (third address) indicating the output order of the calculation results is input to the output memory 27 included in the foremost processing core 20 via the data output unit 25. The output memory 27 outputs the calculation result stored in the output memory 27 to the calculation result output unit 28 according to the input output address. Furthermore, the output memory 27 outputs the write address input from the data output unit 25 to the connected calculation result output unit 28. That is, the calculation result output unit 28 receives the write address (fourth address) indicating the destination of the calculation result via the data output unit 25 and the output memory 27. The calculation result output unit 28 outputs the calculation result input from the output memory 27 to the specified data memory 22 according to the input write address. As a result, each calculation result is read from the output memory 27 by the data output unit 25 and input to each calculation unit 26 as new input data in the next calculation process. 【0034】 The arithmetic processing unit 100 includes a transfer bus 90 that connects the data output units included in the preceding arithmetic cores of the data output units 25, 35, 45, and 55, with the data output units included in the subsequent arithmetic cores. The data address, output address, and write address input to the data output unit 25 are shared by the subsequent arithmetic core via the transfer bus 90. The data output unit 25 of the arithmetic core 20, which has received the data address, output address, and write address as input, outputs (transfers) the data address, output address, and write address to the data output unit 35 of the subsequent arithmetic core 30 via the transfer bus 90. The data output unit 35 of the arithmetic core 30, which has received the data address, output address, and write address as input, outputs (transfers) the data address, output address, and write address to the data output unit 45 of the subsequent arithmetic core 40 via the transfer bus 90. The data output unit 45 of the arithmetic core 40, which has received the data address, output address, and write address as input, outputs (transfers) the data address, output address, and write address to the data output unit 55 of the last stage arithmetic core 50 via the transfer bus 90. 【0035】 As a result, data addresses are shared between the data output sections of the arithmetic cores 20, 30, 40, and 50 via the transfer bus 90, and the convolution process is performed using the same data addresses in each arithmetic core. Furthermore, the output memory of the arithmetic cores 20, 30, 40, and 50 receives the output address via the transfer bus 90 and the data output section of each arithmetic core. This allows the output address to be shared among the output memories, and output processing is performed using the same output address in each arithmetic core. Furthermore, the write address is input to the calculation result output units of the calculation cores 20, 30, 40, and 50 via the transfer bus 90 and the data output unit and output memory of each calculation core. As a result, the write address is shared among the calculation result output units, and the write process is executed using the same write address in each calculation core. 【0036】 As described above, the arithmetic unit that constructs the systolic array of each arithmetic core performs a sum-of-products operation on the input data and weights, as shown in Figure 7, and stores the calculation result in the output memory. The arithmetic unit 26, which constructs the systolic array of the front-end arithmetic core 20, performs a sum-of-products operation on the weights stored in the weight memory 21 and the input data stored in the data memory 22, and stores the result of the sum-of-products operation in the output memory 27. The arithmetic unit 36 ​​of the subsequent arithmetic core 30 performs a sum-of-products operation with the weights transferred from the arithmetic unit 26 and the input data stored in the data memory 32, and stores the result in the output memory 37. The arithmetic units that construct the systolic array of each arithmetic core may, if necessary, perform addition operations that add a value called a bias term in addition to multiply-accumulate operations. At this time, the weights input to each arithmetic core are the same, and weight data is input to the arithmetic unit with each clock cycle. Furthermore, the input data input to each processing core corresponds to the low-resolution segmented images p1 to p4 shown in Figure 2(b), and the input data is input to the processing unit every clock cycle. 【0037】 The results of the multiply-accumulate operations performed by the arithmetic cores 20 and 30 in Figure 7 are as follows. As an example, in the foremost arithmetic core 20, D11*W11+D12*W12+D13*W13, D11*W21+D12*W22+D13*W23, D11*W31+D12*W32+D13*W33 D21*W11+D22*W12+D23*W13, D21*W21+D22*W22+D23*W23, D21*W31+D22*W32+D23*W33 D31*W11+D32*W12+D33*W13, D31*W21+D32*W22+D33*W23, D31*W31+D32*W32+D33*W33 This is written to output memory 27. In the subsequent processing core 30, D41*W11+D42*W12+D43*W13, D41*W21+D42*W22+D43*W23, D41*W31+D42*W32+D43*W33 D51*W11+D52*W12+D53*W13, D51*W21+D52*W22+D53*W23, D51*W31+D52*W32+D53*W33 D61*W11+D62*W12+D63*W13, D61*W21+D62*W22+D63*W23, D61*W31+D62*W32+D63*W33 This is written to output memory 37. It can be seen that the same result is obtained as in the case of the sum-of-products operation shown in Figure 4. 【0038】 Furthermore, between the arithmetic unit that constructs the systolic array of each arithmetic core and the output memory, there is a processing block (not shown) that performs pooling and activation function processing. The control circuit 10 then issues instructions to the processing block according to the learned model, performing pooling and activation function processing on the processing results of each arithmetic unit, and stores the results as arithmetic results in the output memory. Finally, the control circuit 10 reads the calculation results from each output memory according to the trained model and generates high-resolution segmented images P1 to P4 shown in Figure 2(c). Furthermore, the control circuit 10 combines the segmented images P1 to P4 to generate the image P0 shown in Figure 2(d), which is a high-resolution version of the image p0 shown in Figure 2(a). 【0039】 As described above, the arithmetic processing unit 100 of this embodiment focuses on the fact that each processing core performs super-resolution processing using the same trained model, and therefore shares the weights stored in the weight memory 21 of the frontmost processing core 20 with the subsequent processing cores 30, 40, and 50. Accordingly, in the arithmetic processing unit 100, only the frontmost processing core 20 is equipped with a weight memory and a weight output unit. Furthermore, the arithmetic processing unit 100 shares the weights input to the foremost arithmetic core 20 among the subsequent arithmetic cores, and in response, the various addresses generated by the address generator 23 of the foremost arithmetic core 20 are shared among the subsequent arithmetic cores 30, 40, and 50. Accordingly, in the arithmetic processing unit 100, only the foremost arithmetic core 20 is equipped with an address generator. In contrast, in a arithmetic processing unit 200 that simply applies super-resolution processing, all processing cores are equipped with an address generator 83, a weight memory 81, and a weight output unit 84. The arithmetic processing unit 100 can reduce computing resources in super-resolution processing using inference processing. 【0040】 Figure 8 is a flowchart showing the processes performed by the processing unit. The flowchart in Figure 8 illustrates the super-resolution processing performed by each processing unit of the arithmetic processing unit 100. In step S101, the address generator 23 of the arithmetic core 20 generates various addresses using the address generation information input by the control circuit 10. In step S102, the address generator 23 outputs a weighted address to the weight output unit 24. In step S103, the address generator 23 outputs a data address to the data output unit 25. In step S104, the data output unit 25 outputs a data address to the data output unit 35 of the subsequent processing core 30. Furthermore, the data output unit 35 of the processing core 30, and the data output unit 45 of the subsequent processing core 40, each output data addresses to the data output units of the subsequent processing cores. The data output unit 55 included in the last processing core 50 does not output data addresses to the data output units of the subsequent processing cores. 【0041】 In step S105, the control circuit 10 stores the weights to be used for the convolution process in the weight memory 21 of the arithmetic core 20. In step S106, the control circuit 10 stores the input data used for the convolution process in the output memory 27. At this time, the control circuit 10 also stores the input data in the output memories of the other processing cores 30, 40, and 50. The input data corresponds to the low-resolution segmented images p1 to p4 shown in Figure 2(b). The processes S105 and S106 described above may be performed simultaneously, or they may be executed in an order determined by the user as appropriate. 【0042】 In step S107, the weight output unit 24 of the arithmetic core 20 reads the weights from the weight memory 21 according to the weight addresses and outputs the weights to the vertically leading arithmetic unit 26. These weights are then transferred from the arithmetic unit 26 to each of the last arithmetic units of the arithmetic core 50 via the vertical arrow paths between the subsequent arithmetic cores. In step S108, the data output unit 25 of the arithmetic core 20 reads input data from the connected data memory 22 according to the data address and outputs the input data to the arithmetic unit at the forefront in the lateral direction. During the convolution process, the input data is transferred laterally from the arithmetic unit at the forefront to each of the final arithmetic units. This is the same for other arithmetic cores as well. The order in which processes S107 and S108 are executed may be determined by the user as appropriate. 【0043】 In step S109, the arithmetic unit 26 performs a sum-of-products operation using the input weights and input data, and stores the calculation result in the connected output memory 27. More specifically, each arithmetic unit performs a sum-of-products operation using the weights and input data that are transferred with each clock cycle. The same applies to the arithmetic units that construct the systolic array of the subsequent arithmetic cores. The arithmetic unit 36 ​​of the systolic array 30 performs a multiply-accumulate operation using the weights input from the arithmetic unit 26 and the input data stored in the data memory 32, and stores the calculation result in the connected output memory 37. 【0044】 In step S110, the control circuit 10 determines whether the convolution process has been completed in the arithmetic core 20. In this embodiment, completion of the convolution process means that, as shown in Figure 7, each arithmetic unit has stored the 3x3 calculation result in the connected output memory. More specifically, it means that the element-wise sum-of-products operation is completed for the kernel used in the convolution process and the numerical data of a partial image (a part of the image region included in the divided image) of the same size as the kernel. Note that the convolution process may be considered complete each time an arithmetic unit outputs one calculation result, or it may be considered complete when any other arbitrary process is completed. The control circuit 10 determines whether the convolution process has been completed in the other processing cores as well. If the control circuit 10 determines that the convolution process is not complete (No in step S110), it causes the processing in steps S107 to S109 to be executed again by the arithmetic core. 【0045】 Since each arithmetic core executes steps S104 to S107 at each clock timing, the arithmetic core executes steps S104 to S107 as appropriate at the timing of the next clock. When the control circuit 10 determines that the convolution process is complete (Yes in step S110), in step S111, the arithmetic core outputs and stores the output addresses in each output memory. In the case of the arithmetic core 20, when an output address is input to the output memory 27, it outputs the calculation result to the connected calculation result output unit 28 according to the output address. The output memory 37 of the processing core 30, the output memory 47 of the processing core 40, and the output memory 57 of the processing core 50 each perform the same processing as the output memory 27. The output address is input to output memory 27 via the data output unit 25. Additionally, the output addresses are input to output memory 37, output memory 47, and output memory 57 via the transfer bus 90 and their respective data output units. Furthermore, the processing in step S111 may be executed at any appropriate time, provided that it is completed before the processing in step S112. 【0046】 In step S112, the control circuit 10 outputs the calculation result to the data memory. To do this, the control circuit 10 outputs the write address to the calculation result output unit 2 of the calculation core 20 via the output memory 27. When a write address is input, the calculation result output unit 28 outputs the calculation result to the specified data memory 22 according to the write address. The calculation result output units 38, 48, and 58, to which the write address has been transferred, each perform the same processing as the calculation result output unit 28. As a result, each calculation result is read from the output memory in each calculation core and input to each calculation unit as new input data for the next calculation process. 【0047】 The control circuit 10 reads the processing results of steps S101 to S112 from the output memory of each processing core according to the trained model and generates high-resolution segmented images P1 to P4 shown in Figure 2(c). Furthermore, the control circuit 10 combines the segmented images P1 to P4 to generate the image P0 shown in Figure 2(d), which is a high-resolution version of the image p0 shown in Figure 2(a). The timing for reading the results of steps S101 to S112 from each output memory according to the trained model may be, for example, when the super-resolution processing of each segmented image is completed after repeatedly executing the above steps S101 to S112 while striding the kernel. 【0048】 The above explanation assumes a configuration in which the calculation result stored in output memory is written by the calculation result output unit according to the write address and stored in the designated data memory; however, the system is not limited to this configuration. By directly connecting the output of the arithmetic unit that constructs the systolic array in each arithmetic core to the data memory, the output of the arithmetic core can be directly input to the data memory without providing an output memory and an arithmetic result output unit. In this case, a block that performs pooling and activation function processing is actually inserted between the output of the arithmetic unit and the data memory, and the data after these processes is stored in the data memory. This embodiment reduces the number of address generators and the amount of weight memory used in the arithmetic processing unit by configuring the arithmetic processing unit to be specialized for super-resolution processing. 【0049】 The arithmetic processing unit 100 of this embodiment can be applied to various devices. For example, the arithmetic processing unit 100 can be applied to a gaming machine. Figure 9 shows a schematic diagram of the gaming machine's configuration. As shown in Figure 9, the gaming machine U comprises a main control board (main board) 1A, a performance control board (sub-board) 2A, an external storage device 3, and a display device 4. The main control board 1A is equipped with the main control unit (main CPU) 1. The performance control board 2A is equipped with the performance control device 2 (sub-CPU). Game machine U is, for example, a pachinko game machine that uses game balls as the game medium. The main control board 1A and the performance control board 2A, the performance control board 2A and the external storage device 3, and the performance control board 2A and the display device 4 are all connected in a way that allows them to communicate with each other. The performance control device 2 executes various performances of the gaming machine U based on commands input from the main control board 1A. 【0050】 External storage device 3 is an external storage device connected to a memory interface (not shown) provided by the performance control device 2, and is, for example, an SSD (Solid State Drive) or ROM. External storage device 3 stores various performance data used for the game machine U's effects, as well as the boot loader and basic software such as the OS (Operating System) for starting the performance control device 2. The performance control device 2 reads image data from the external storage device 3 in response to a command from the main control device 1 and draws the image. The performance control device 2 uses the drawn image to create a display image to be shown on the display device 4. The performance control device 2 outputs the display image to the display device 4 and displays the image on the display device 4. 【0051】 Figure 10 shows the configuration of the performance control device. In Figure 10, the performance control device 2 comprises at least a central control CPU 110, a storage device 111, an image processing circuit 112, and an input / output interface 113. Each component is connected by a bus 114. An external storage device 3 is connected to the performance control device 2 via the input / output interface 113. Additionally, a display device 4 is connected to the image processing circuit 112 via an interface (not shown). The overall control CPU 110 controls the entire performance control device 2. The integrated control CPU 110 receives commands from the main control unit 1 and determines the content of the performance according to the commands. The integrated control CPU 110 controls the image processing circuit 112 and controls the display of images to realize the determined performance content. The memory device 111 is RAM and is used as a workspace for the image processing circuit 112 and the overall control CPU 110. 【0052】 The image processing circuit 112 reads image data for presentation purposes from the external storage device 3 in response to instructions from the overall control CPU 110, and performs image data drawing and display on the display device 4. The image processing circuit 112 incorporates the arithmetic processing unit 100 of this embodiment and can apply, for example, super-resolution processing when displaying image data on the display device 4. Furthermore, the performance control device 2, which incorporates the arithmetic processing unit 100 having the configuration shown in Figure 10, can be used not only in amusement machines but also in other amusement machines, game machines, digital signage, and other general equipment that have a display device. It can also be incorporated into a monitor device that connects to an external device for display. In that case, the main control unit 1 is not included in the configuration shown in Figure 10, and image data input from an external device such as a PC connected via the input / output I / F 113 can be displayed after, for example, super-resolution processing is applied to it. 【0053】 This embodiment is not limited to the embodiments described above, and various configurations or embodiments can be taken without departing from the spirit of this embodiment. [Explanation of Symbols] 【0054】 100 Arithmetic Processing Unit, 10 Control Circuits, 20 Arithmetic Cores, 22 Data Memory, 23 Address Generator, 24 Weight Output Unit, 25 Data Output Unit, 26 Arithmetic Unit, 27 Output Memory, 28 Arithmetic Result Output Unit, 30 Arithmetic Cores, 32 Data Memory, 35 Data Output Unit, 36 Arithmetic Unit, 37 Output Memory, 38 Arithmetic Result Output Unit, 40 Arithmetic Cores, 42 Data Memory, 45 Data Output Unit, 46 Arithmetic Unit, 47 Output Memory, 48 Arithmetic Result Output Unit, 50 Arithmetic Cores, 52 Data Memory, 55 Data Output Unit, 56 Arithmetic Unit, 57 Output Memory, 58 Arithmetic Result Output Unit, 60 Transfer Bus, 90 Transfer Bus, 200 Arithmetic Processing Unit, 70 Control Circuits, 81 Weight Memory, 82 Data Memory, 83 Address Generator, 84 Weight Output Unit, 85 Data Output Unit, 86 Arithmetic Unit, 87 Output Memory, 88 Calculation result output unit, 93 Address generator

Claims

[Claim 1] An arithmetic processing unit comprising multiple arithmetic units arranged in a predetermined shape and multiple arithmetic cores that perform convolution processing using the same trained model, The multiple arithmetic units included in the aforementioned arithmetic core output weights input from the preceding arithmetic unit to the subsequent arithmetic unit, and process the data using the input weights. The aforementioned arithmetic processing unit is A first storage unit, which is provided only in the front-most arithmetic core and stores the weights to be transferred to each of the aforementioned arithmetic units, It comprises a first bus connecting the preceding processing core and the subsequent processing core, The arithmetic unit included in the preceding arithmetic core outputs the input weights to the foremost arithmetic unit included in the subsequent arithmetic core via the first bus. A processing unit characterized by the following features. [Claim 2] The last processing unit included in the preceding processing core outputs the input weights via the first bus. The arithmetic processing device according to feature 1. [Claim 3] The foremost processing core includes a generation unit that generates a first address indicating the order in which the weights stored in the first memory unit are transferred to each of the processing units, A first output unit reads the weight from the first storage unit according to the first address and outputs the weight to the first-stage arithmetic unit included in the first-stage arithmetic core, The arithmetic processing device according to claim 1, characterized by comprising: [Claim 4] The first address indicates the order of weights to be output to one of the processing cores. The arithmetic processing device according to claim 3. [Claim 5] The aforementioned processing core further, The system includes a second storage unit that stores input data to be transferred between the calculation units in a second direction different from the first direction, which is the transfer direction of weights in the predetermined shape. The aforementioned processing unit further, The foremost arithmetic core includes a generation unit that generates a second address indicating the order in which input data stored in the second memory is transferred to each of the arithmetic units, The aforementioned processing core further, A second output unit reads input data from each of the second storage units according to the second address and outputs the input data to the first-stage arithmetic unit in the second direction included in the arithmetic core, It comprises a second bus connecting a second output unit included in the preceding arithmetic core and a second output unit included in the subsequent arithmetic core, The second output unit included in the preceding arithmetic core outputs the input second address to the second output unit included in the subsequent arithmetic core via the second bus. The arithmetic processing device according to feature 1. [Claim 6] The aforementioned processing core further, The third storage unit stores the calculation results of each of the aforementioned calculation units and outputs the calculation results according to a third address indicating the order in which the calculation results are output. The aforementioned processing unit further, The foremost processing core is provided and the third memory unit stores A generation unit that generates a third address indicating the order in which the calculation results being performed are output, It comprises a second bus connecting the preceding processing core and the succeeding processing core, The aforementioned processing core, The input third address is output to the downstream processing core via the second bus. The arithmetic processing device according to feature 1. [Claim 7] The aforementioned processing core further, A third storage unit that stores the calculation results of each of the aforementioned calculation units, The system includes a third output unit that outputs the calculation result according to a fourth address indicating the destination where the calculation result is to be written, The aforementioned processing unit further, The foremost processing core includes a generation unit that generates the fourth address, It comprises a second bus connecting the preceding processing core and the succeeding processing core, The aforementioned processing core, The input fourth address is output to the downstream processing core via the second bus. The arithmetic processing device according to feature 1. [Claim 8] A method for performing calculations on an arithmetic processing unit that includes multiple processing units and multiple processing cores that perform convolution operations using the same trained model, The multiple arithmetic units included in the aforementioned arithmetic core output weights input from the preceding arithmetic unit to the subsequent arithmetic unit, and process the data using the input weights. The weights to be transferred to each of the aforementioned arithmetic units are stored in the memory unit that is only present in the front-end arithmetic core. Any of the arithmetic units included in the preceding arithmetic core outputs the input weights to the foremost arithmetic unit included in the subsequent arithmetic core via a first bus connecting the preceding arithmetic core and the subsequent arithmetic core. A method for performing calculations characterized by the following: