Inference processing device and inference processing method
By dividing video frames into tiles and adjusting bit width and decimal point positions based on correlation, the method improves CNN inference accuracy and throughput in video data processing.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- NT T INC
- Filing Date
- 2024-12-16
- Publication Date
- 2026-06-25
AI Technical Summary
Existing CNN inference processing in video data on dedicated hardware faces challenges in balancing inference accuracy and throughput due to fixed bit-width settings, which are inadequate in scenarios with low correlation between frames, leading to reduced calculation accuracy or throughput.
The proposed method divides video frames into tiles, determines high-correlation positions using motion search, and adjusts bit width and decimal point positions based on tile-specific value ranges and correlation values to improve accuracy and throughput.
This approach enhances inference accuracy and throughput by dynamically adjusting bit width and decimal point positions, addressing the limitations of fixed bit-width settings in low-correlation scenarios.
Smart Images

Figure JP2024044480_25062026_PF_FP_ABST
Abstract
Description
Inference processing device and inference processing method
[0001] The disclosed technologies relate to inference processing devices and inference processing methods.
[0002] In a Convolutional Neural Network (CNN), the network model consists of multiple layers, and the convolutional layer performs convolution. The convolution process takes input feature maps from previous layers and kernels (weight coefficients) as input. For each input channel (ich), the input feature map and kernel are repeatedly multiplied in a two-dimensional plane to perform a convolution operation. The results are then accumulated and added across all input channels to generate a convolution result for a single output channel. A bias value is added to this result, followed by activation function processing. This yields an output feature map for that single output channel. Performing these processes for all output channels generates the output feature map for one convolutional layer. The generated output feature map becomes the input feature map for the next layer. When performing CNN inference on video, the above CNN processing is applied to each image (frame) that makes up the video.
[0003] When implementing the inference process of CNN on dedicated hardware such as LSI (Large Scale Integration) and FPGA (Field Programmable Gate Array), in order to reduce the circuit scale of the arithmetic circuit, the convolution arithmetic circuit is often composed of fixed-point int-type multiply-accumulate operations. For example, when convolution operations are processed software-wise on a general-purpose CPU (Central Processing Unit) or the like, they are often multiply-accumulated in 32-bit float-type floating-point numbers (fp32), but in dedicated hardware for CNN inference, they may be multiply-accumulated in 8-bit int-type fixed-point numbers (int8). In this case, the conventional convolution arithmetic circuit 110 shown in FIG. 11 includes a set of int8-bit × int8-bit multiply-accumulate arithmetic circuits. In the example of FIG. 11, the multiply-accumulate arithmetic circuit multiplies an int8 input feature map and an int8 kernel, and temporarily stores the result in the accumulation memory 132, for example, in 16 bits. In the next multiplication, multiplication of int8s is performed again. The previous int16 result is read from the accumulation memory 132, added to the current result, and then stored in the accumulation memory 132 again. This is repeated to execute the multiply-accumulate operation. After all the multiply-accumulate operations are completed, the int16 multiply-accumulate result is bit-reduced to an appropriate number of bits, in the example of FIG. 11, 8 bits, and then output.
[0004] In this bit-width reduction, the decimal point position (from which lower bit position to extract) and the reduced bit width (how many bits in total to extract) affect the inference accuracy of the CNN inference process. In the example of FIG. 11, the decimal point position is the fourth bit from the lower position, and the reduced bit width is 8 bits. Adjusting the decimal point position affects the inference accuracy. Also, adjusting the bit width, especially increasing the reduced bit width, improves the inference accuracy while increasing the arithmetic scale of the multiply-accumulate operation of the convolution operation, which leads to a decrease in the inference processing speed (throughput). Therefore, in order to balance the inference accuracy and throughput in CNN hardware, it is necessary to reduce the bit width to an appropriate value.
[0005] A technique has been proposed to adjust the decimal point position to a range close to the representation of the parameter value by controlling the decimal point position for each layer of the CNN (see, for example, Non-Patent Document 1). Furthermore, when processing each frame contained in the video sequentially, a technique has been proposed to dynamically determine the range adjustment for each layer of each frame based on the result of the previous frame, based on the characteristic that the difference in pixel values between the frame to be processed and the previous frame is small (see, for example, Patent Document 1). In the method described in Patent Document 1, an upper limit counter is provided for the data values of the past frame, which counts the number of data values that exceed the maximum value of the quantization step width, and a lower limit counter is provided for the number of data values that fall below the minimum value. In the method described in Patent Document 1, a range of values is selected such that the values of these counters do not exceed a threshold, and the decimal point position is adjusted to the optimal value for each frame.
[0006] In Patent Document 1, if adjusting only the decimal point position is insufficient, the bit width can be adjusted simultaneously with the decimal point position adjustment using the values of the upper limit counter and the lower limit counter. This allows the bit width to be increased only when necessary, improving inference accuracy while suppressing a significant increase in throughput.
[0007] International Publication No. 2022 / 003855
[0008] Zhisheng Li et.al, "Laius: An 8-bit Fixed-point CNN Hardware Inference Engine", 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA / IUCC).
[0009] The above method assumes a high correlation between the frame being processed and past frames in video data, and determines the decimal point position and bit width of the frame being processed based on the count of past frames. Therefore, in situations where the correlation between past frames and the frame being processed is low, such as in fast-moving video or before and after scene changes, the decimal point position and bit width determined based on past frames may not provide adequate inference accuracy or throughput. For example, the set bit width may be insufficient, leading to reduced calculation accuracy, or it may be set to an unnecessarily high bit depth, resulting in reduced throughput. Furthermore, because one bit width is set per CNN layer, it is difficult to improve inference accuracy as it cannot reflect the variation due to the two-dimensional position within the layer.
[0010] The technology disclosed herein has been made in view of the above points, and provides an inference processing device and an inference processing method that can improve inference accuracy regardless of the correlation between the frame to be processed in the video data and past frames.
[0011] The inference processing device of the present disclosure includes: a motion search unit that divides each frame of video data, which includes multiple frames, into a plurality of two-dimensional tiles; for each tile of the frame to be processed, a motion search unit that determines a high-correlation position, which is a position in the past frame that has a high correlation with the tile to be processed, based on a correlation value derived by searching for movement between the tile to be processed in the frame to be processed and past frames that are earlier than the frame to be processed; and a bit width determination unit that obtains information on the value range of the tile in the past frame that corresponds to the tile to be processed, based on the high-correlation position, and determines the bit width of the tile to be processed after a convolution operation of a convolutional neural network that takes the video data as input, based on the information on the value range.
[0012] Furthermore, the inference processing method of this disclosure includes dividing each frame of video data, which includes multiple frames, into multiple two-dimensional tiles; determining a high-correlation position, which is a position in the past frame with a high correlation to the tile being processed, based on a correlation value derived by searching for the movement between the tile being processed in the frame being processed and past frames prior to the frame being processed; obtaining information on the value range of the tile in the past frame corresponding to the tile being processed, based on the high-correlation position; and determining the bit width of the tile being processed after a convolution operation of a convolutional neural network using the video data as input, based on the information on the value range.
[0013] The technology disclosed herein makes it possible to improve inference accuracy regardless of the correlation between the frame being processed in the video data and past frames.
[0014] This is a block diagram showing an example of the hardware configuration of the inference processing unit of the embodiment. This is a functional block diagram showing an example of the configuration of the inference processing unit of the embodiment. This is a diagram for explaining bit width shift processing. This is a diagram for explaining the reference destination of the current frame. This is a diagram for explaining the reference destination of the current frame. This is a diagram for explaining the relationship between the difference value and the correlation judgment threshold. This is a diagram for explaining the expansion of the influence range of motion search / compensation by kernel convolution. This is a diagram for explaining the correspondence of the influence by kernel convolution. This is a diagram for explaining the case where the accuracy of the motion vector is on a pixel basis. This is a diagram for explaining the case where it spans multiple tiles. This is a diagram for explaining the prior art.
[0015] An example of an embodiment of the disclosed technology will be described below with reference to the drawings. In each drawing, identical or equivalent components and parts are given the same reference numerals. Furthermore, the dimensional ratios in the drawings are exaggerated for illustrative purposes and may differ from actual ratios.
[0016] The inference processing device according to this embodiment is a device that performs convolutional neural network inference processing on each frame of video data. Figure 1 is a block diagram showing an example of the hardware configuration of the inference processing device 10 according to this embodiment. As shown in Figure 1, the inference processing device 10 includes a CNN (Convolutional Neural Network) calculation circuit 20, an external memory 22, a CPU (Central Processing Unit) 24, and a communication I / F (Interface) 26. The external memory 22 stores kernel and feature map data, etc.
[0017] The CNN arithmetic circuit 20 performs CNN processing such as convolution using a kernel and the like transferred from the external memory 22. The CNN arithmetic circuit 20 in this embodiment is hardware designed specifically for CNN. The CNN arithmetic circuit 20 is configured as a dedicated electrical circuit, which is a processor having a circuit configuration specifically designed to execute a particular process, such as a PLD (Programmable Logic Device) such as an FPGA (Field-Programmable Gate Array) whose circuit configuration can be changed after manufacturing, or an ASIC (Application Specific Integrated Circuit).
[0018] The CPU 24 performs processing other than CNN processing by executing a predetermined program. For example, the CPU 24 in this embodiment performs bit width determination and decimal point position determination. The predetermined program is stored in, for example, an external memory 22. Unlike the inference processing unit 10 of this embodiment, the inference processing unit 10 may be configured not to have a CPU 24. In this case, for example, the CNN calculation circuit 20 may perform all processing, including bit width determination and decimal point position determination, in addition to CNN processing. Alternatively, for example, an external device (not shown) connected via a communication I / F 26 may be used to perform processing including bit width determination and decimal point position determination.
[0019] The communication interface 26 is an interface for communicating with external devices, and standards such as Ethernet®, FDDI (Fiber Distributed Data Interface), and Wi-Fi® are used. The communication interface 26 is used, for example, when initially transferring kernel data to the external memory 22 before inference execution.
[0020] The CNN calculation circuit 20, external memory 22, CPU 24, and communication interface 26 are interconnected via a bus 29, such as a system bus or control bus, to enable communication with each other.
[0021] Referring to Figure 2, the specific configuration of the inference processing device 10 of this embodiment will be described. Figure 2 is a block diagram showing an example of the configuration relating to the CNN calculation circuit 20 and the bit width determination by the CPU 24 of this embodiment.
[0022] As shown in Figure 2, the CNN calculation circuit 20 includes a convolution calculation unit 30, a convolution calculation result generation unit 40, an activation function unit 50, and a motion search unit 60.
[0023] The convolution unit 30 takes an input feature map and a kernel as inputs and outputs a primary result of the convolution operation. The convolution unit 30 has an accumulation memory 32. It repeatedly performs multiply-accumulate operations on the input feature map and kernel, and after each operation, it temporarily stores the data in the accumulation memory 32, reads the result of the previous operation and adds it during the next multiply-accumulate operation. The bit width of the accumulation memory 32 is set to be larger than the two bit widths input to the multiply-accumulate operation. That is, it is set to be larger than the sum of the bit widths of the input feature map and kernel. For example, if the input feature map input to the multiply-accumulate unit is 8 bits and the input kernel is 8 bits, the bit width of the accumulation memory 32 is set to 16 bits. If the bit width of the accumulation memory 32 is exceeded during the repeated multiply-accumulate operations, it is clipped so that it fits within the bit width of the accumulation memory 32. For example, 16 bits can represent values from -32768 to 32767, but if the sum-of-products result becomes 4000, it is corrected (clipped) to 32767 and written to the cumulative memory 32. Once all cumulative additions are complete, the convolution unit 30 outputs the data in the cumulative memory 32 as the primary result of the convolution operation.
[0024] The convolution operation result generation unit 40 takes the primary result of the convolution operation output from the convolution operation unit 30 as input and outputs the final result of the convolution operation. The convolution operation result generation unit 40 has a bit width reduction unit 42 and a saturation counter unit 44. The bit width reduction unit 42 performs a shifting process so that the primary result of the convolution operation fits within a predetermined bit width. As an example, Figure 3 shows a case where the data in the cumulative memory 32 is 16 bits, the bit width is 8 bits, and the decimal point position is specified as the 3rd bit from the least significant bit (LSB) of the 16 bits. In this case, the bit width reduction unit 42 right-shifts the 16-bit data by 3 bits to create 8-bit data. At this time, there are cases where the final result of the convolution operation exceeds the upper limit (exceeding -128 or 127 in the case of 8 bits) and cases where it falls below the lower limit (when it is 0 in the case of 8 bits), and these are called upper limit saturation and lower limit saturation, respectively. For 8-bit data, the upper limit is -128 or 127, and the lower limit is 0. In other words, for 8-bit data, if the final result of the convolution operation exceeds the upper limit of -128 or 127, it is called upper limit saturation, and if it is less than or equal to the lower limit of 0, it is called lower limit saturation.
[0025] The bit width reduction unit 42 reduces bits that overflow to the higher end if upper or lower limit saturation is not reached. For example, if the result after >>3 is +101100110, the upper two bits are reduced to +1100110. If upper limit saturation occurs, clipping is performed and the value is set to the upper limit (-128, 127 in the case of 8 bits). If lower limit saturation occurs, the value is set to the lower limit (0 in the case of 8 bits).
[0026] The saturation counter unit 44 counts the number of times the upper limit saturation and lower limit saturation occur during bit reduction in the bit width reduction unit 42. The saturation counter unit 44 outputs these as the upper limit count and lower limit count, respectively. This count data is used to determine the bit width of the next frame. The upper limit count and lower limit count count counted by the saturation counter unit 44 in this embodiment is an example of information regarding the tile value range of this disclosure.
[0027] The activation function unit 50 outputs an output feature map by passing the final result of the convolution operation after bit reduction, output from the convolution operation result generation unit 40, through an activation function.
[0028] The convolution unit 30, the convolution result generation unit 40, and the activation function unit 50 process each frame of the video data in tile units, which are divided into specific two-dimensional sizes. For example, if 32x32 is used as one tile unit and the two-dimensional size of the feature map is 64x64, it is divided into four tiles and processed tile by saturation counter unit 44. The upper and lower limit counts are also counted and output on a tile basis. The bit width input to bit width reduction unit 42 is also set on a tile basis.
[0029] The motion search unit 60 takes the current frame, which is the frame to be processed, and the previous frame, which is a frame prior to the current frame, as input, searches for movement on a tile-by-tile basis, and performs motion compensation. Figure 4 shows an example in which the frame is divided into 16 tiles of 32 x 32. As shown in Figure 4, the motion search unit 60 of this embodiment determines how each tile in the current frame has been translated from the previous frame and uses this as a reference. In the example shown in Figure 4, the motion search unit 60 determines that tile 0, which is the frame to be processed in the current frame, has been translated from tile 5 in the previous frame, and also derives a motion vector (32, 32). The motion search unit 60 derives a correlation value between the current frame and the previous frame, and uses the correlation value as an indicator to determine the destination of movement. As an example of the correlation value, in this embodiment, the difference value between each pixel of the tile to be processed in the current frame and each pixel of the tile in the previous frame is used. For example, if the video data is RGB color data, the motion search unit 60 calculates the difference values between tile 0, which is the processing tile for the current frame, and 32 × 32 × 3 = 3072 pixels on the previous frame. The motion search unit 60 then considers the position on the previous frame where the sum of the absolute values (SATD: Sum of Absolute Transformed Differences) of the 3072 difference values is smallest to be the position with the highest correlation to tile 0 in the current frame (called the high-correlation position). Note that instead of the difference value, a cost value generated by adding the difference value to the number of unintended bits uniquely determined by the magnitude of the motion vector may be used as the correlation value. This cost value is frequently used in video encoding processing.
[0030] The motion search unit 60 outputs a motion vector for each tile and a difference value, which is the correlation value between the tile being processed in the current frame and the referenced tile, to the bit width determination unit 70. Note that the bit width determination unit 70 only needs to output the correlation value, and for example, the cost value mentioned above may be output to the bit width determination unit 70 instead of the difference value. Alternatively, for example, the motion vector determination may be made using the cost value, and the difference value may be output to the bit width determination unit 70. Conversely, the motion vector determination may be made using the difference value, and the cost value may be output to the bit width determination unit 70.
[0031] The bit width determination unit 70 in the CPU 24 takes the motion vector of the tile to be processed in the current frame, the difference value of the tile to be processed in the current frame, and the saturation count of the previous frame, each layer, and each tile as inputs to determine the bit width of the tile to be processed in the current frame, the current layer, and the tile to be processed. The bit width determination unit 70 obtains the upper limit saturation count and the lower limit saturation count from the reference of the previous frame pointed to by the motion vector of the tile to be processed in the current frame.
[0032] In the example shown in Figure 5, the motion vector is (32, 32), and it can be considered that tile 5 of the previous frame has the highest correlation with tile 0, which is the tile to be processed in the current frame. Therefore, the bit width determination unit 70 obtains the upper limit saturation count and the lower limit saturation count calculated by the saturation counter unit 44 for tile 5 of the previous frame. Based on the obtained upper limit saturation count and lower limit saturation count, the bit width determination unit 70 determines the bit width of the tile in the current frame, current layer, and to be processed. Specifically, similar to International Publication No. 2022 / 003855, the bit width determination unit 70 compares the count and the threshold and increases the bit width by 1 bit for the side with the larger threshold overrun rate. If only one exceeds the threshold, the bit width of the exceeding side is increased. If both exceed the threshold, the bit width of the side with the larger overrun rate is increased. Therefore, similar to the bit width increase described above, the bit width determination unit 70 compares the count and the threshold and reduces the bit width if there is sufficient margin in the count.
[0033] As shown in Figure 6, the bit width determination unit 70 of this embodiment, in order to address the problem that an appropriate bit width cannot be determined due to a low correlation between the previous frame and the current frame, compares the difference value of the tiles to be processed in the current frame, calculated by the motion search unit 60, with a correlation judgment threshold. If the difference value is larger, it is considered that the correlation between the previous frame and the current frame is low, and the bit width derived from the count of the previous frame is considered unreliable, so the bit width is reset (initialized) to a default value, for example, 8 bits, without using the bit width derived from the count of the previous frame. On the other hand, if the difference value is less than or equal to the correlation judgment threshold, it is considered that the correlation between the previous frame and the current frame is high, and the bit width derived from the count of the previous frame is considered highly reliable, so the bit width derived from the count of the previous frame is used as described above.
[0034] In the examples shown in Figures 5 and 6, the first layer of the CNN (layer number 0: L0) is determined, i.e., the bit width for the input frame (input image). Since layer 0 is identical to the input frame, the motion vector and difference value obtained by the motion search unit 60 in the input frame can be used as is. On the other hand, in layers 1 and beyond of the CNN, it is necessary to consider the expansion of the influence range of motion search / compensation due to kernel convolution. As an example, Figure 7 shows the influence range when layers L0, 1, 3, and 5 are 1x1 kernels and layers L2, 4, and 6 are 3x3 kernels. With each 3x3 kernel, the influence range is increased by 1 pixel in all directions (up, down, left, and right), so a tile that is 32x32 in layer number L6 will be affected by 38x38 in layer number L0, with approximately 3 pixels added to all directions (up, down, left, and right).
[0035] In this embodiment, considering the expansion of the influence range of motion search / compensation by kernel convolution, a weighted average is performed on the upper and lower saturation counts. As an example, Figure 8 shows the case where the motion vector is 32 × 32 in layer 6 of Figure 7. The size of the tile whose influence range has been considered is 38 × 38, and it spans 9 tiles, so a weighted average is performed using the ratio of the overlapping area between the 9 tiles and the tile whose influence range has been considered, to derive the weighted averaged upper saturation count and weighted averaged lower saturation count. This weighted averaged upper saturation count and weighted averaged lower saturation count are compared with a correlation judgment threshold to determine the bit width of the current frame, current layer, and the tile to be processed. If the difference value of the current frame and the tile to be processed is greater than the correlation judgment threshold, the bit width is initialized in any layer, similar to layer 0.
[0036] As described above, the inference processing device 10 of this embodiment determines the bit width for each tile of each frame of the video data. The determined bit width is used when the bit width reduction unit 42 of the convolution operation result generation unit 40 reduces the bit width.
[0037] In the above, the precision of the motion vector was set to tile units, but the precision of the motion vector is not limited to tile units; for example, it may be set to pixel units. It may also be set to fractional pixel precision units. In such cases, as shown in Figure 9, the reference point in past frames may span multiple tiles rather than a single specific tile. In the example shown in Figure 9, the area corresponding to the tile to be processed overlaps with four tiles: tile 0, tile 1, tile 4, and tile 5.
[0038] Figure 10 shows a case where the motion vector is (28, 20), the precision of the motion vector is pixel-level, and the reference spans multiple tiles. In such cases, a weighted average is used based on the ratio of the reference tile position to the overlapping area. In Figure 10, of the 32x32 area corresponding to the tile to be processed, 12x4 overlaps with tile 0, 20x4 with tile 1, 12x28 with tile 4, and 20x28 with tile 5. Therefore, the upper and lower saturation counts of tiles 0, tile 1, tile 4, and tile 5 are weighted and averaged in the ratio of 12x4:20x4:12x28:20x28. Using the weighted averaged upper and lower saturation counts, a threshold comparison is performed as described above to determine the bit width of the current frame, current layer, and the tile to be processed.
[0039] Although the above process only describes the determination of tile-level bit width using motion vectors and difference values, similar to International Publication No. 2022 / 003855, it is possible to simultaneously determine the decimal point position and bit width on a tile-level basis using motion vectors and difference values.
[0040] Alternatively, the L1 norm and L2 norm can be calculated for the motion vector to determine its magnitude. If the magnitude of the motion vector is greater than or equal to the motion vector threshold, it is likely that the correlation with the previous frame is low, and the bit width can be initialized accordingly.
[0041] Alternatively, if the magnitude of the motion vector is greater than or equal to the motion vector threshold, the bit width may be left unchanged instead of being initialized. In this case, the bit width will only be changed when the magnitude of the motion vector is less than the motion vector threshold.
[0042] Alternatively, instead of using a weighted average of the upper and lower saturation counts based on the overlap ratio with the referenced tile, you can simply use the upper and lower saturation counts of the tile with the largest overlapping area.
[0043] Further, as a determination of initialization or non-change of the bit width, instead of using the motion search / compensation result, an image analysis unit prepared separately may be used. For example, the image analysis unit calculates the average and variance of pixel values (luminance values) for each frame, and if the values of the average and variance change significantly between the previous frame and the current frame, and the change in the average and the change in the variance exceed the threshold value, it is highly likely that the correlation is low, and the bit width may be initialized or the change in the bit width may be set to no change.
[0044] As described above, the inference processing apparatus 10 of each of the above embodiments includes a motion search unit 60 and a bit width determination unit 70. The motion search unit 60 divides each frame of video data including a plurality of frames into a plurality of two-dimensional tiles. Further, for each tile of the frame to be processed, the motion search unit 60 searches for the motion between the tile to be processed in the frame to be processed and a past frame that is earlier than the frame to be processed, and derives a correlation value. Furthermore, the motion search unit 60 determines a highly correlated position that is the position in the past frame with a high correlation with the tile to be processed based on the correlation value. The bit width determination unit 70 acquires information regarding the value range of the tile of the past frame corresponding to the tile to be processed based on the highly correlated position. Further, the bit width determination unit 70 determines the bit width after the convolution operation of the convolutional neural network with the video data as an input for the tile to be processed based on the information regarding the value range.
[0045] As described above, according to the inference processing apparatus 10 of the present embodiment, since the bit width is determined in units of tiles obtained by dividing the frame, the bit width can be controlled with fine granularity. Therefore, according to the inference processing apparatus 10 of the present embodiment, the inference accuracy can be improved regardless of the correlation between the frame to be processed and the past frame of the video data.
[0046] Note that, in each of the above embodiments, the processing executed by the CNN arithmetic circuit 20 may be executed by another processor such as the CPU 24. Further, the processing executed by the CPU 24 by loading a predetermined program (software) in each of the above embodiments may be executed by various processors other than the CPU. Examples of the processor in this case include a PLD (Programmable Logic Device) whose circuit configuration can be changed after manufacture, such as an FPGA (Field-Programmable Gate Array), and a dedicated electric circuit which is a processor having a circuit configuration dedicatedly designed to execute specific processing such as an ASIC (Application Specific Integrated Circuit). Further, each of the processes executed by the CNN arithmetic circuit 20 and the CPU 24 may be executed by one of these various processors, or may be executed by a combination of two or more processors of the same type or different types (for example, a plurality of FPGAs, and a combination of a CPU and an FPGA, etc.). Further, the hardware structure of these various processors is, more specifically, an electric circuit combining circuit elements such as semiconductor elements.
[0047] Also, all documents, patent applications, and technical standards described in this specification are incorporated herein by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually stated to be incorporated by reference.
[0048] Regarding the above embodiments, the following additional claims are further disclosed.
[0049] (Note 1) An inference processing device comprising: a motion search unit that divides each frame of video data containing multiple frames into multiple two-dimensional tiles, and for each tile of the frame to be processed, determines a high-correlation position, which is a position in the past frame that has a high correlation with the tile to be processed, based on a correlation value derived by searching for the movement between the tile to be processed in the frame to be processed and past frames that are earlier than the frame to be processed; and a bit width determination unit that obtains information on the value range of the tile in the past frame that corresponds to the tile to be processed, based on the high-correlation position, and determines the bit width of the tile to be processed after a convolution operation of a convolutional neural network using the video data as input, based on the information on the value range.
[0050] (Addendum 2) The bit width determination unit determines the bit width based on the information regarding the value range of each of the multiple tiles of the past frame and the weight coefficient, when the area corresponding to the tile to be processed at the high correlation position overlaps with a plurality of tiles of the past frame, using the ratio of the overlapping area between the area corresponding to the tile to be processed at the high correlation position and the plurality of tiles as a weight coefficient.
[0051] (Note 3) The inference processing apparatus according to Note 1 or Note 2, wherein the bit width determination unit determines the bit width based on the value range information obtained based on the high-correlation position when the correlation value is less than or equal to the correlation threshold, and instead of determining the bit width based on the value range information obtained based on the high-correlation position when the correlation value is greater than the correlation threshold, the bit width is set to a predetermined bit width.
[0052] (Appendix 4) An inference processing method that includes the following steps: dividing each frame of video data containing multiple frames into multiple two-dimensional tiles; determining a high-correlation position in the past frame that has a high correlation with the tile to be processed, based on a correlation value derived by exploring the movement between the tile to be processed in the frame to be processed and past frames that are earlier than the frame to be processed; obtaining information about the value range of the tile in the past frame that corresponds to the tile to be processed based on the high-correlation position; and determining the bit width of the tile to be processed after a convolution operation of a convolutional neural network using the video data as input, based on the information about the value range.
[0053] (Note 5) A program to enable a computer to function as: a motion search unit that divides each frame of video data containing multiple frames into multiple two-dimensional tiles, and for each tile of the frame to be processed, searches for movement between the tile to be processed in the frame to be processed and past frames prior to the frame to be processed, and determines a high-correlation position which is a position in the past frame that has a high correlation with the tile to be processed, based on a correlation value derived from the search for movement between the tile to be processed in the frame to be processed and past frames prior to the frame to be processed; and a bit width determination unit that obtains information on the value range of the tile in the past frame that corresponds to the tile to be processed, based on the high-correlation position, and determines the bit width after a convolution operation of a convolutional neural network using the video data as input for the tile to be processed.
[0054] (Appendix 6) An inference processing device comprising at least one processor, wherein the processor divides each frame of video data, which includes multiple frames, into a plurality of two-dimensional tiles, and for each tile of the frame to be processed, determines a high-correlation position, which is a position in the past frame that has a high correlation with the tile to be processed, based on a correlation value derived by searching for the movement between the tile to be processed in the frame to be processed and past frames that are earlier than the frame to be processed, and obtains information regarding the value range of the tile in the past frame that corresponds to the tile to be processed, based on the high-correlation position, and performs a process to determine the bit width of the tile to be processed after a convolution operation of a convolutional neural network using the video data as input, based on the information regarding the value range.
[0055] (Appendix 7) A non-temporary storage medium storing a program executable for performing inference processing by a computer, wherein the inference processing divides each frame of video data, which includes multiple frames, into a plurality of two-dimensional tiles, and for each tile of the frame to be processed, determines a high-correlation position, which is a position in the past frame that has a high correlation with the tile to be processed, based on a correlation value derived by searching for movement between the tile to be processed in the frame to be processed and past frames that are earlier than the frame to be processed, and obtains information regarding the value range of the tile in the past frame that corresponds to the tile to be processed, and performs processing to determine the bit width after a convolution operation of a convolutional neural network using the video data as input for the tile to be processed, based on the information regarding the value range.
[0056] 10 Inference processing unit 20 CNN calculation circuit 24 CPU 30 Convolution calculation unit 40 Convolution calculation result generation unit, 42 Bit width reduction unit, 44 Saturation counter unit 60 Motion search unit 70 Bit width determination unit
Claims
1. An inference processing device comprising:
1. A motion search unit that divides each frame of video data containing multiple frames into multiple two-dimensional tiles, and for each tile of the frame to be processed, determines a high-correlation position, which is a position in the past frame that has a high correlation with the tile to be processed, based on a correlation value derived by searching for the movement between the tile to be processed in the frame to be processed and past frames that are earlier than the frame to be processed; and 2. A bit width determination unit that obtains information on the value range of the tile in the past frame that corresponds to the tile to be processed, based on the high-correlation position, and determines the bit width of the tile to be processed after a convolution operation of a convolutional neural network using the video data as input, based on the information on the value range.
2. The inference processing apparatus according to claim 1, wherein the bit width determination unit determines the bit width based on the information regarding the value range of each of the multiple tiles of the past frame and the weight coefficient, if the area corresponding to the tile to be processed at the high correlation position overlaps with a plurality of tiles of the past frame, using the ratio of the overlapping area between the area corresponding to the tile to be processed at the high correlation position and the plurality of tiles as a weight coefficient.
3. The inference processing apparatus according to claim 1, wherein the bit width determination unit determines the bit width based on the value range information obtained based on the high-correlation position if the correlation value is less than or equal to the correlation determination threshold, and instead of determining the bit width based on the value range information obtained based on the high-correlation position if the correlation value is greater than the correlation determination threshold, the bit width is set to a predetermined bit width.
4. An inference processing method that includes dividing each frame of video data containing multiple frames into multiple two-dimensional tiles, determining a high-correlation position in the past frame which has a high correlation with the tile to be processed, based on a correlation value derived by exploring the movement between the tile to be processed in the frame to be processed and past frames prior to the frame to be processed, obtaining information on the value range of the tile in the past frame corresponding to the tile to be processed based on the high-correlation position, and determining the bit width of the tile to be processed after a convolution operation of a convolutional neural network using the video data as input, based on the information on the value range.