Scheduling method, scheduling device, electronic device, and storage medium

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The scheduling method optimizes AI chip architectures by replicating valid data rows between computing units, addressing inefficiencies in data parallelism and model parallelism to enhance computing power utilization in multilayer neural networks.

JP7876064B2Active Publication Date: 2026-06-18BEIJING YOUZHUJU NETWORK TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Patents
Current Assignee / Owner: BEIJING YOUZHUJU NETWORK TECH CO LTD
Filing Date: 2023-09-18
Publication Date: 2026-06-18

Application Information

Patent Timeline

18 Sep 2023

Application

18 Jun 2026

Publication

JP7876064B2

IPC: G06N3/0464; G06N3/10; G06N3/063

CPC: G06F9/4843; G06N3/063; G06F9/48; G06N3/0464; G06F9/5066; G06N3/098; G06F9/4881; G06F2209/5017

AI Tagging

Application Domain

Program initiation/switching Neural architectures

Technology Topics

Computer network Engineering

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Current AI chip architectures face challenges in efficiently processing dynamic computational graphs due to limitations in data parallelism and model parallelism, leading to repetitive data calculations and suboptimal utilization of computing power, especially in multilayer convolutional neural networks.

Method used

A scheduling method that involves multiple computing units performing convolution operations, determining data replication transmission modes based on placement rules, and replicating valid data rows to optimize computing power utilization, reducing repetitive calculations.

Benefits of technology

The method enhances computing power utilization by minimizing redundant operations and improving efficiency in processing multilayer convolutional neural networks.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 0007876064000001
Figure 0007876064000002
Figure 0007876064000003

Patent Text Reader

Abstract

The present invention provides a scheduling method, a scheduling device, an electronic device, and a storage medium. The scheduling method includes: a plurality of computing units each performing a first convolution calculation on a corresponding plurality of data sets to obtain a corresponding plurality of first calculation result sets, the plurality of first calculation result sets being for constituting a first convolution layer obtained by the first convolution calculation; determining a data duplication and transmission mode corresponding to the plurality of first calculation result sets in the plurality of computing units according to an arrangement rule for the plurality of computing units of a second convolution layer obtained by the plurality of computing units performing a second convolution calculation on the first convolution layer; and obtaining a first intermediate data row required for padding in the second convolution calculation process by the first computing unit from the first calculation result set in the second computing unit based on the data duplication and transmission mode corresponding to the first computing unit among the plurality of computing units that is to pad a valid data row. The scheduling method can effectively reduce repeated data calculations and improve the utilization rate of chip computing capacity.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application claims the priority of Chinese Patent Application No. 202211188739.0 filed on September 27, 2022, and the entire content disclosed in the above Chinese patent application is incorporated herein by reference as part of this application.

[0002] Embodiments of the present disclosure relate to a scheduling method, a scheduling device, an electronic device, and a storage medium.

Background Art

[0003] An artificial intelligence (AI) chip is a chip specialized for performing neural network operations and is a chip specifically designed to accelerate the execution of neural networks. With the development of artificial intelligence (AI), the amount of parameters in algorithm models has increased exponentially, and the need for computing power is growing ever larger.

[0004] Conventional hardware architectures (e.g., a central processing unit (CPU)) consider the balance between different service needs at the architecture design stage, so the computing power available in AI applications is limited. Considering high computing power, current AI chips widely use a general-purpose graphics processing unit (GPU) in addition to a domain-specific accelerator (DSA) of the same type of multi-core architecture.

Summary of the Invention

Means for Solving the Problems

[0005] At least one embodiment of the present disclosure provides a scheduling method for a multilayer convolutional neural network, wherein a plurality of computing units each perform a first convolution operation on a plurality of corresponding datasets to obtain a plurality of corresponding first result sets, the plurality of first result sets for constituting a first convolutional layer obtained by the first convolution operation, wherein the plurality of computing units include a first computing unit and a second computing unit, and the plurality of computing units determine a data replication transmission mode corresponding to the plurality of first result sets in the plurality of computing units according to placement rules in the plurality of computing units for a second convolutional layer obtained by performing a second convolution operation on the first convolutional layer, and the first computing unit to pad the valid data rows among the plurality of computing units obtains first intermediate data rows necessary for padding in the second convolution operation process by the first computing unit from the first result set in the second computing unit based on the corresponding data replication transmission mode.

[0006] At least one embodiment of the present disclosure further provides a scheduling device configured such that a plurality of computing units each perform a first convolution calculation on a plurality of corresponding datasets to obtain a plurality of corresponding first calculation result sets, wherein the plurality of first calculation result sets constitute a first convolutional layer obtained by the first convolution calculation, and the plurality of computing units comprises a computing control module including a first computing unit and a second computing unit; an allocation scheduling module configured to determine a data replication transmission mode in the plurality of computing units corresponding to the plurality of first calculation result sets in the plurality of computing units according to placement rules in the plurality of computing units for a second convolutional layer obtained by the plurality of computing units performing a second convolution calculation on the first convolutional layer; and a data transmission module configured so that a first computing unit to pad the valid data rows among the plurality of computing units obtains first intermediate data rows necessary for padding in the second convolution calculation process by the first computing unit from the first calculation result set in the second computing unit based on the corresponding data replication transmission mode.

[0007] At least one embodiment of the present disclosure further provides an electronic device comprising a scheduling device according to any one embodiment of the present disclosure.

[0008] At least one embodiment of the present disclosure further provides an electronic device comprising a processor and a memory including at least one computer program module, wherein the at least one computer program module is stored in the memory and configured to be executed by the processor, and the at least one computer program module is for implementing a scheduling method described in any one embodiment of the present disclosure.

[0009] At least one embodiment of the present disclosure further provides a storage medium on which non-temporary computer-readable instructions are stored, and when the non-temporary computer-readable instructions are executed by a computer, the scheduling method described in any one embodiment of the present disclosure is implemented.

[0010] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the drawings of the embodiments are briefly described below, and it is clear that the drawings described below relate only to some embodiments of this disclosure and do not limit this disclosure. [Brief explanation of the drawing]

[0011] [Figure 1A] Figure 1A is a schematic diagram of a parallel model. [Figure 1B] Figure 1B is a schematic diagram of data parallelism. [Figure 2] Figure 2 is a schematic diagram of the convolution computation process based on data parallelism, as shown in the computation graph. [Figure 3] Figure 3 is a schematic flowchart of a scheduling method according to at least one embodiment of the present disclosure. [Figure 4] Figure 4 is a schematic flowchart of step S20 in Figure 3. [Figure 5] Figure 5 is a schematic flowchart of step S30 in Figure 3. [Figure 6A] Figure 6A is a schematic diagram of a data replication transmission process according to at least one embodiment of the present disclosure. [Figure 6B] Figure 6B is a schematic diagram of a data replication transmission process according to at least one embodiment of the present disclosure. [Figure 6C] Figure 6C is a schematic diagram of a data replication transmission process according to at least one embodiment of the present disclosure. [Figure 6D] Figure 6D is a schematic diagram of a data replication transmission process according to at least one embodiment of the present disclosure. [Figure 7] Figure 7 is a schematic block diagram of a scheduling device according to at least one embodiment of the present disclosure. [Figure 8] Figure 8 is a schematic block diagram of an electronic device according to at least one embodiment of the present disclosure. [Figure 9] Figure 9 is a schematic block diagram of another electronic device relating to at least one embodiment of the present disclosure. [Figure 10] Figure 10 is a schematic block diagram of another electronic device relating to at least one embodiment of the present disclosure. [Figure 11] Figure 11 is a schematic diagram of a storage medium according to at least one embodiment of the present disclosure. [Modes for carrying out the invention]

[0012] To further clarify the objectives, technical solutions, and advantages of the embodiments of this disclosure, the technical solutions of the embodiments of this disclosure will be described clearly and completely below with reference to the drawings of the embodiments of this disclosure. Clearly, the embodiments described are a part of the embodiments of this disclosure, but not all of them. Any other embodiments that a person skilled in the art can obtain based on the embodiments of this disclosure without requiring inventive work are all within the scope of protection of this disclosure.

[0013] Unless otherwise defined, technical or scientific terms used in this disclosure should have the ordinary meaning understood by those skilled in the art within the scope of this disclosure. The terms “first,” “second,” and similar terms used in this disclosure do not indicate any order, number, or importance, but are merely used to distinguish different components. Similar terms such as “equipment” or “includes” mean that the element or component described before the term covers the elements or components and equivalents listed after the term, but do not exclude other elements or components. Similar terms such as “connection” or “linking” are not limited to physical or mechanical connections, but also include electrical connections, whether direct or indirect. “Up,” “down,” “left,” “right,” etc., refer only to relative positional relationships, and such relative positional relationships may change if the absolute position of the described object changes.

[0014] The following describes the present disclosure by way of several specific examples. In order to maintain the following description of the embodiments of the present disclosure clear and concise, detailed descriptions of known functions and known members (elements) may be omitted. When any one member (element) of the embodiments of the present disclosure appears in one or more drawings, the member (element) is denoted by the same or similar reference numerals in each drawing.

[0015] The core idea of a domain-specific accelerator (DSA) is to perform specialized tasks with dedicated hardware. For example, a DSA satisfies applications in a domain rather than a certain application. Therefore, a DSA can balance flexibility and specialization. A DSA accelerator refers to a device that realizes the expansion of computing power and the acceleration of calculations by interconnecting multiple computing units (or processing units (PEs, Process Elements) or processing cores, for example) on a single chip. The multiple processing units are connected to each other via, for example, a communication line, and the communication line may be, for example, a bus, an on-chip network, or the like. How such an AI accelerator with multiple interconnected PEs effectively schedules hardware resources to efficiently complete the inference of a deep neural network model is the main challenge faced by an AI compiler.

[0016] Currently, the multi-core compilation scheduling scheme may have the following two implementation methods.

[0017] One approach is model parallelism, that is, different devices are responsible for calculating different parts of the computational graph. For example, in some domain-specific accelerators, first, each operator in the computational graph is statically assigned to different PEs inside the chip, and then the input data is input to the PE where the first operator is located. After the PE where the first operator is located completes the calculation, the calculated data is transmitted to the next PE, and it is necessary to continue until the calculations of all operators are completed. Only the operators in the local internal memory in each PE of model parallelism are executed. If the calculation of the previous operator has not ended, each PE cannot continue to calculate. This is a way of continuously executing operations in a pipeline manner.

[0018] For example, as shown in Figure 1A, three different types of operators are loaded into PE0, PE1, and PE2 respectively. For example, these three types of operators are convolution operators corresponding to different layers of the neural network. For example, PE0 has a convolution operator A, PE1 has a convolution operator B, and PE2 has a convolution operator C. During calculation, the server first loads the input data into PE0 where the convolution operator A is located. After PE0 completes the first convolution calculation on the input data, the result of the first convolution calculation is transmitted to PE1, and the second convolution calculation corresponding to the convolution operator B is executed in PE1. By analogy, after the execution of the convolution operator C in PE2 is completed, the final result obtained in PE2 is returned to the host.

[0019] However, model parallelism needs to be statically compiled and it is necessary to assign some operators in the computational graph to a certain number of PEs. Therefore, it cannot process a computational graph containing dynamic shapes. Dynamic shape means that the shape of the tensor depends on specific operations and cannot be obtained by pre-calculation, that is, the dynamic computational graph is constructed by the calculation of each step. Therefore, the model parallelism method that only includes the computational graph part cannot process the dynamic computational graph.

[0020] Another method is data parallelism, where each device has its own complete computation graph. For example, in some GPUs, each operator in the computation graph is first loaded sequentially into the GPU device, each operator is placed in parallel on multiple PEs, then the input data is decomposed into multiple parts and loaded into multiple PEs, the multiple PEs process the input data in parallel, and after all operators have completed their calculations, the calculation results from the multiple PEs are aggregated and sent back to the host.

[0021] For example, as shown in Figure 1B, PE0, PE1, and PE2 are each loaded with three different operators, which are convolution operators corresponding to different layers of a neural network. For example, PE0, PE1, and PE2 each have convolution operators A, B, and C. During computation, the server first decomposes the input data into multiple parts, for example, uniformly into three parts of input data, then loads the three parts of input data into PE0, PE1, and PE2 respectively, and PE0, PE1, and PE2 sequentially execute multiple convolution calculations corresponding to convolution operators A, B, and C on the input data in parallel. After the computation is complete, the system collects the final results from PE0, PE1, and PE2 and sends them back to the host.

[0022] Achieving data parallelism still lacks a consistent, viable mode, requiring scheduling optimization for different hardware characteristics. When deploying actual models, some models have large amounts of operator data, while others have small amounts. Therefore, different deployment schemes must be applied to the AI chip based on the model to obtain optimal reasoning performance.

[0023] In current data parallelism schemes, for large amounts of input data, it is not possible to place all the input data necessary for the calculation within a single PE at once. Therefore, it is generally necessary to decompose the input data into multiple groups, place each group of data in a different PE, have multiple PEs perform parallel calculations, and then, after the calculations are completed, collect the calculation results from the multiple PEs to obtain the final output result corresponding to the input data.

[0024] However, in the reasoning computation process of a multilayer neural network, if you consider the padding operation of the convolution computation and directly decompose a large picture or a large amount of data into multiple sets, it becomes easier to introduce the problem of iterative computation of data.

[0025] For example, one example computation graph is: %5:[(1 224 224 64),F,FP32]=conv(%0:[(1 224 224 64),F,FP32], 1%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1,pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1), %10:[(1 224 224 64),F,FP32]=conv(%5:[(1 224 224 64),F,FP32], 6%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1,pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1), and This includes three convolution operators (conv): %15:[(1 224 224 64),F,FP32]=conv(%10:[(1 224 224 64),F,FP32], 11%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1,pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1).

[0026] %0, %5, %10, and %15 represent tensor data. For example, the input image for the first convolution layer of a neural network may be represented as [B,H,W,C], meaning the dimensions of the input data are B×H×W×C. For example, %0 indicates that the size of the input image for the first convolution layer of a neural network is [1,224,224,64], meaning the input image has 64 channels, each channel has image dimensions (height H × width W) of 224×224, and totals one batch. For example, the size of the convolution kernel used by each convolution operator conv may be represented as kh×kw, meaning kh=3, and kw=3 indicates that the dimensions of the convolution kernel are 3×3. For example, the padding operation for each convolution operator conv is indicated by pad, for example, pad_h_top=1 indicates padding one row at the top of the picture in the height h direction, for example, padding with 0; pad_h_bottom=1 indicates padding one row at the bottom of the picture in the height h direction, for example, padding with 0; pad_w_left=1 indicates padding one column at the left side of the picture in the width w direction, for example, padding with 0; and pad_w_right=1 indicates padding one column at the right side of the picture in the width w direction, for example, padding with 0. For example, the slide step size of the convolution kernel in each convolution operator conv is denoted by stride, where stride_h=1 indicates that when the convolution kernel moves in the height h direction of the picture, it moves by one pixel point each time, and stride_w=1 indicates that when the convolution kernel moves in the width w direction of the picture, it moves by one pixel point each time. The calculation is performed using floating-point arithmetic with FP32 precision.

[0027] For example, Figure 2 is a schematic diagram showing a convolution calculation process using data parallelism as shown in the computation graph, and this computation graph includes the three convolution operators described above. For example, if it is not possible to place all the data necessary for the calculation in the internal memory of one PE at once, it is necessary to decompose the input data into multiple datasets based on the internal memory and number of PEs. For example, in Figure 2, there are three PEs and the size of the input data is 224 × 224, so the input data is decomposed into three datasets, each loaded into one of the three PEs, and the convolution calculation is performed.

[0028] The decomposition of input data and the size and range of the decomposed datasets may be selected based on the output data. For example, in the computation graph shown in Figure 2, if the size of the output data %15 is 224 × 224, and the data is selected to be decomposed into approximately three uniform parts by the height dimension H of the output data, then the (height) sizes of the three rounds round1, round2, and round3, which correspond to the three PEs of the three output datasets, are 75 rows, 75 rows, and 74 rows, respectively. When splitting input data, the output data of each part must be obtained by forward recursion by preferentially deep traversing the entire dataset to obtain the range of the input data for each part.

[0029] What needs to be explained is that all the data shown in the computation graph in Figure 2 is "valid data". For simplicity, "valid data" here refers to the actual meaningful valid data among the data acquired each time a convolution calculation is performed. When actually performing the calculation, padding operations are necessary to maintain that the image size does not change after the convolution calculation. Therefore, the convolution kernel and the region for performing the convolution calculation, consisting of the data at the "edges" of each decomposed dataset and the "0s" to be padded, are invalid regions, and the result obtained by performing a convolution calculation on this region is invalid data and must be discarded. That is, since the output data obtained by performing a convolution calculation on the complete input data is valid data, corresponding to the decomposed dataset, only the data obtained by performing a convolution calculation on the data in the valid region is valid data. For clarity of illustration and conciseness of explanation, all the data shown in the computation graph in the drawings of this application is valid data.

[0030] For example, in Figure 2, round1 shows that the output data %15 has 75 rows, i.e., [0,74] rows of valid data. Since the padding operation is pad_h_top=1 and pad_h_bottom=1, the intermediate data %10 obtained by forward recursion from %15 must have 76 rows, i.e., [0,75] rows of valid data. The intermediate data %5 obtained by forward recursion from intermediate data %10 must have 77 rows, i.e., [0,76] rows of valid data. By sequential analogy, the final input data %0 obtained must have 78 rows of valid data. Thus, the range of read data in the height direction of the input data %0 obtained in round1 is [0,77] (row numbers start from 0), totaling 78 rows.

[0031] Similarly, in round2 of Figure 2, it is shown that the output data %15 has 75 rows, i.e., [75,149] of valid data. Since the padding operation is pad_h_top=1 and pad_h_bottom=1, the intermediate data %10 obtained by forward recursion from %15 must have 77 rows, i.e., [74,150] of valid data, and the intermediate data %5 obtained by forward recursion from intermediate data %10 must have 79 rows, i.e., [73,151] of valid data, and by sequential analogy, the final input data %0 obtained must have 81 rows of valid data. Thus, the range of read data in the height direction of the input data %0 obtained in round2 is [72,152], totaling 81 rows.

[0032] Similarly, round3 in Figure 2 shows that the output data %15 has 74 rows, i.e., [150,223] rows of valid data. Since the padding operation is pad_h_top=1 and pad_h_bottom=1, the intermediate data %10 obtained by forward recursion from %15 must have 75 rows, i.e., [149,223] rows of valid data, and the intermediate data %5 obtained by forward recursion from intermediate data %10 must have 76 rows, i.e., [148,223] rows of valid data, and by sequential analogy, the final input data %0 obtained must have 77 rows of valid data. Thus, the range of read data in the height direction of the input data %0 obtained in round3 is [147,223], totaling 77 rows.

[0033] As is clearly visible from the computation graph in Figure 2, the iterative computation area for round 1 and round 2 is [72,77] rows, i.e., 6 rows repeated, and the iterative computation area for round 2 and round 3 is [147,152] rows, i.e., 6 rows repeated. As can be seen from this, the deeper the convolutional network, the more forward recursion it takes, and the more effective data required for the edges of the dataset due to padding on the previous layer, i.e., the larger the area that needs to be iterated over, which results in redundant overhead of a large amount of computational power.

[0034] At least one embodiment of the present disclosure provides a scheduling method for a multilayer convolutional neural network. The scheduling method comprises a plurality of computing units each performing a first convolution operation on a plurality of corresponding datasets to obtain a plurality of corresponding first result sets, the plurality of first result sets for constituting a first convolutional layer obtained by the first convolution operation, wherein the plurality of computing units include a first computing unit and a second computing unit; determining a data replication transmission mode for the plurality of first result sets in the plurality of computing units according to a placement rule in the plurality of computing units for the second convolutional layer obtained by the plurality of computing units performing a second convolutional operation on the first convolutional layer; and obtaining a first intermediate data row from the first result set in the second computing unit necessary for padding in the second convolution operation process by the first computing unit, based on the data replication transmission mode of the first computing unit to be padded with valid data rows among the plurality of computing units. This scheduling method enables balanced utilization of the computing power of computing units by replicating and transmitting valid data from one computing unit to another, thereby reducing repetitive data calculations and improving the utilization rate of computing power.

[0035] At least one embodiment of the present disclosure further provides a scheduling device, electronic equipment, and a storage medium. The scheduling device, electronic equipment, and storage medium similarly enable balanced utilization of the computing power of computing units by replicating and transmitting valid data in one computing unit to another computing unit, thereby reducing repetitive data calculations and improving the utilization rate of computing power.

[0036] The embodiments of this disclosure will be described in detail below with reference to the drawings. It should be noted that the same reference numerals in different drawings indicate the same element described.

[0037] Figure 3 is a schematic flowchart of a scheduling method according to at least one embodiment of the present disclosure. As shown in Figure 3, the scheduling method includes steps S10 to S30.

[0038] Step S10 Multiple computing units each perform a first convolution calculation on multiple corresponding datasets to obtain multiple corresponding first calculation result sets, and these multiple first calculation result sets constitute a first convolution layer obtained by the first convolution calculation, and the multiple computing units include a first computing unit and a second computing unit, Step S20 The data replication transmission mode corresponding to the multiple sets of first calculation results in the multiple computing units is determined according to the arrangement rules for the second convolutional layer in the multiple computing units obtained by the multiple computing units performing the second convolutional calculation on the first convolutional layer. Step S30 The first computing unit, which is to pad the valid data rows among the multiple computing units, obtains the first intermediate data rows necessary for padding in the second convolution calculation process by the first computing unit from the first calculation result set in the second computing unit, based on the corresponding data replication transmission mode.

[0039] For example, the scheduling method may be used in computing devices such as AI chips that perform convolutional operations on multilayer convolutional neural networks, general-purpose graphics processing units (GPGPUs), and for example, the AI chip may be an AI chip that uses a DSA accelerator or an AI chip having multiple PE units, and for example, it may be used for data-parallel distributed training, and the embodiments of this disclosure are not limited thereto.

[0040] It should be explained that in at least one embodiment of this disclosure, “first convolution calculation” may refer to a single convolution calculation performed on a complete input image, or to a single convolution calculation performed on the current input dataset in the computation unit. “First convolution calculation” may not be a convolution calculation performed only on the first layer, but may be any single convolution calculation in a multi-layer convolution calculation. For example, “first convolution calculation” may be a single convolution calculation performed on the first layer, or a single convolution calculation performed on the second layer, third layer, etc.

[0041] In at least one embodiment of this disclosure, “second convolution calculation” refers to a convolution calculation performed on the next convolution calculation corresponding to the “first convolution calculation,” i.e., the “first convolution layer” obtained after the first convolution calculation. “First convolution layer” is a convolution layer consisting of the calculation results obtained by the “first convolution calculation,” where “first convolution layer” refers to an actual convolution layer obtained by performing the first convolution calculation on an undisassembled complete input image. For example, the first convolution layer may be the input data for the second convolution calculation. Similarly, “second convolution layer” is a convolution layer consisting of the calculation results obtained by the “second convolution calculation,” where “second convolution layer” refers to an actual convolution layer obtained by performing the second convolution calculation on the first convolution layer consisting of the acquired undisassembled data.

[0042] The scheduling method according to the embodiment of this disclosure will be described in detail below with reference to the calculation graphs shown in Figures 6A to 6D. In the following exemplary description, three PEs (Figures 6A to 6C) will be described as examples, but the embodiment of this disclosure is not limited to these, and the computing device may include, for example, two PEs, four PEs (Figure 6D), or more PEs.

[0043] It should be explained that the strip-shaped regions shown in Figures 6A to 6D abstractly represent the extent of at least some of the valid data in a certain dimension (e.g., height H or width W) of the input image that is convolved each time. The diagonal regions, grid lines, and diamond lines abstractly represent the datasets in the calculation units (PE0, PE1, PE2, etc.), and their positions in these strip-shaped regions merely schematically represent the corresponding positions in the input image of those datasets and do not limit the embodiments of this disclosure.

[0044] For example, in step S10, the corresponding “multiple datasets” in multiple computing units may be multiple initial input datasets obtained by decomposing an initial input matrix (these initial input datasets are input to the computing device), or they may be multiple input datasets that are any one-time convolution calculation objects in the computing device. For example, the multiple datasets may be multiple datasets 10, 20, and 30 that perform convolution calculations on %0 on PE0, PE1, and PE2 in Figure 6A, or they may be multiple datasets that perform convolution calculations on %5. For example, the dataset that each computing unit performs a convolution calculation on %5 includes a first calculation result set and intermediate data rows obtained from another computing unit. That is, in at least one embodiment of the present disclosure, the first calculation result set in the first computing unit and the first intermediate data rows obtained from the second computing unit constitute the dataset necessary for the first computing unit to perform a second convolution calculation.

[0045] For example, in step S10, multiple computing units each perform a first convolution calculation on their corresponding datasets to obtain a corresponding set of first calculation results. For example, the set of first calculation results may be used to construct a first convolutional layer obtained by the first convolution calculation. For example, as shown in Figure 6A, multiple computing units PE0, PE1, and PE2 each perform a first convolution calculation on their corresponding datasets 10, 20, and 30 to obtain a set of first calculation results 11, 21, and 31, and these set of first calculation results 11, 21, and 31 may be used to construct a first convolutional layer obtained by the first convolution calculation.

[0046] For example, in the embodiments of this disclosure, multiple sets of first calculation results in multiple computing units may directly constitute a first convolutional layer obtained by a first convolutional calculation, or they may constitute a part of a first convolutional layer obtained by a first convolutional calculation. For example, if the input image is relatively small, multiple computing units can transport all the input data at once, and in this case, by collecting multiple sets of first calculation results in multiple computing units, a first convolutional layer can be formed. If the input image is relatively large, multiple computing units cannot transport all the input data at once, and the input data must be loaded into the computing units in multiple batches. If the input data loaded each time is decomposed into multiple sets and each set is loaded into multiple computing units, then multiple sets of first calculation results in multiple computing units will constitute only a part of the first convolutional layer corresponding to the input data loaded this time.

[0047] For example, in step 10, the data rows in multiple first calculation result sets are consecutive in the first convolutional layer without overlap. For example, as shown in Figure 6A, the data rows in multiple first calculation result sets 11, 21, and 31 are consecutive in the first convolutional layer without overlap; that is, there is no overlap in the data in multiple first calculation result sets 11, 21, and 31 obtained by performing the first convolution calculation on %0 in multiple calculation units PE0, PE1, and PE2.

[0048] For example, in the embodiments of this disclosure, “data row” may represent a row of data or a column of data in the input image. For example, in the dataset decomposed in height dimension H shown in Figure 6A, “data row” may represent a row of data in the height direction of the input image, and of course, “data row” may represent a column of data in the width direction of the input image, and the embodiments of this disclosure are not limited thereto.

[0049] For example, in step S20, the data replication transmission mode corresponding to multiple sets of first calculation results in multiple computing units is determined according to the arrangement rules for the second convolutional layer in multiple computing units obtained by multiple computing units performing a second convolutional calculation on the first convolutional layer. It should be explained that "performing a second convolutional calculation on the first convolutional layer" here does not refer to the actual convolutional calculation in the computing unit, but rather to the rules based on the convolutional calculation. Based on the size of the first convolutional layer, the size of the second convolutional layer to be obtained by performing the second convolutional calculation can be determined, and therefore, "the second convolutional layer to be obtained" does not mean that it is necessary to obtain the specific calculation results of the data in the second convolutional layer. For example, the decomposition mode for the second convolutional layer may be determined based on the size of the second convolutional layer, that is, the arrangement rules for the second convolutional layer in multiple computing units may be determined.

[0050] Figure 4 is a schematic flowchart of step S20 in Figure 3. For example, in some cases, step S20 may further include steps S21 to S23, as shown in Figure 4.

[0051] Step S21 Based on the size of the internal memory in multiple computing units, determine the placement rules for the second convolutional layer in multiple computing units. Step S22: Determine the distribution of multiple computational units in the second convolutional layer according to the placement rules. Step S23: Depending on the distribution, determine the data replication transmission mode corresponding to the first calculation result set in multiple computing units.

[0052] For example, in step S21, the size of the memory (e.g., internal memory) in multiple computing units is obtained, and the placement rules for the second convolutional layer in the multiple computing units are determined based on the size of the internal memory in each computing unit. For example, the second convolutional layer is decomposed into multiple data parts, and the size of the data in each part must not exceed the size of the available capacity of the internal memory allocated to its corresponding computing unit; this is a prerequisite for determining the placement rules.

[0053] For example, in step S22, it is determined how to decompose the second convolutional layer according to the determined arrangement rule, that is, the distribution of the multiple datasets decomposed from the second convolutional layer in multiple computing units is determined. For example, the distribution of the multiple datasets decomposed from the second convolutional layer in the corresponding multiple computing units may be determined in accordance with the distribution of the multiple first computation result sets in multiple computing units and the determined arrangement rule.

[0054] For example, in step S23, a data replication transmission mode corresponding to the first computation result set in a plurality of computing units is determined according to the distribution of the second convolutional layer in a plurality of computing units determined according to the placement rules. For example, the data replication transmission mode may include a one-way data transmission mode and a two-way data transmission mode. For example, a one-way data transmission mode indicates that a computing unit can only retrieve data replicated from another unit, and a two-way data transmission mode indicates that a computing unit can not only retrieve data replicated from another computing unit but also replicate its own data and transmit it to the other computing unit. For example, a plurality of computing units may be configured to have the same data replication transmission mode, or a plurality of computing units may be configured to have different data replication transmission modes, and the data replication transmission mode is not limited to the two modes described above, but may be any other feasible implementation, and the embodiments of this disclosure are not limited thereto.

[0055] For example, in one example, the arrangement rule may be such that the distribution of the data of the second convolutional layer in multiple computing units is the same as the data range of the first calculation result set in all of the multiple computing units. For example, as shown in Figure 6A, multiple computing units PE0, PE1, and PE2 perform a first convolution calculation on the %0 dataset to obtain a first convolutional layer consisting of multiple first calculation result sets 11, 21, and 31. From this, the size of the second convolutional layer obtained by performing a second convolution calculation on the first convolutional layer can be calculated and obtained. Therefore, it is planned to decompose the data of the second convolutional layer and allocate it to multiple computing units PE0, PE1, and PE2 according to the distribution method of %10 in Figure 6A, and according to the arrangement rule, the data replication transmission mode when each computing unit performs a convolution calculation on %5 can be determined. For example, according to the arrangement rule, it can be determined that the data replication transmission mode of the multiple computing units PE0, PE1, and PE2 is all a bidirectional data transmission mode.

[0056] For example, in another example, the arrangement rule may be such that the distribution of the data of the second convolutional layer in some of the multiple computing units is the same as the data range of the first calculation result set in those some computing units. For example, as shown in Figure 6B, after obtaining the size of the second convolutional layer, the data of the second convolutional layer is to be allocated to multiple computing units PE0, PE1, and PE2 according to the distribution method of %10 in Figure 6B, i.e., the data of the second convolutional layer is to be decomposed into three parts, and the data range allocated to PE1 is equal to the size of the data range of the first calculation result set in PE1. This makes it possible to determine the data replication transmission mode when each computing unit performs a convolution calculation on %5 according to the arrangement rule, for example, determining that the data replication transmission mode of the multiple computing units PE0, PE1, and PE2 is all a one-way data transmission mode.

[0057] In the embodiments of this disclosure, the placement rules may be set not only based on the size of the internal memory of the computing units, but may also be flexibly set based on factors such as the allocation of computing power of multiple computing units and data transmission overhead, and the embodiments of this disclosure do not limit the specific content of the placement rules.

[0058] For example, in step S30, the first calculation unit among the multiple calculation units that should pad the valid data rows is determined. For example, all of the multiple calculation units may need to pad the valid data, or only some of the multiple calculation units may need to pad the valid data rows. For example, in Figure 6A, PE0, PE1, and PE2 are all first calculation units that should pad the valid data rows, in Figure 6B, PE1 and PE2 are first calculation units that should pad the valid data rows, and in Figure 6C, only PE1 is a first calculation unit that should pad the valid data rows.

[0059] For example, in step S30, after determining the first computing unit to pad the valid data rows and the data replication transmission mode corresponding to the first computing unit, the first computing unit acquires at least one portion of first intermediate row data from the first calculation result set of another computing unit. For example, the first computing unit may acquire two portions of first intermediate row data from the first calculation result sets of two different computing units, both of which are necessary for padding in the second convolution calculation process performed by the first computing unit. For example, in Figure 6A, PE0, as the first computing unit, acquires one or more rows of first intermediate row data from the first calculation result set 21 of PE1, PE1, as the first computing unit, acquires one or more rows of first intermediate row data from the first calculation result set 11 of PE0 and one or more rows of first intermediate row data from the first calculation result set 31 of PE2, and PE2, as the first computing unit, acquires one or more rows of first intermediate row data from the first calculation result set 21 of PE1.

[0060] Figure 5 is a schematic flowchart of step S30 in Figure 3. For example, in some cases, step S30 may further include steps S31 and S32, as shown in Figure 5.

[0061] Step S31 The size of the first intermediate data row is determined according to the relationship between the second calculation result set obtained by the first calculation unit performing the second convolution calculation and the first calculation result set obtained by the first calculation unit performing the first convolution calculation. Step S32 Based on the data replication transmission mode and the size of the first intermediate data row, the first intermediate data row is obtained from the first calculation result set in the second calculation unit.

[0062] For example, in step S31, the size of the first intermediate data row obtained by the first computing unit from another computing unit is determined by the relationship between the second calculation result set and the first calculation result set, and the relationship between the second calculation result set and the first calculation result set is determined by factors such as the distribution in the computing unit of the second convolutional layer, the size of the convolution kernel, and padding operations.

[0063] For example, in step S32, the acquisition of the first intermediate data row from the first calculation result set in the second calculation unit may be achieved by inter-kernel transmission between the calculation units. For example, the address of the first intermediate data row in the first calculation result set of the second calculation unit is obtained, the first intermediate data row is read and copied to the first calculation unit, and the copied first intermediate data row is written to the corresponding address in the first calculation unit. For example, in Figure 6A, PE1 has [74,150] rows of valid data for %0, and the storage addresses corresponding to these 77 rows of data are address_0 to address_75. Then, after performing the first convolution calculation, the first calculation result set in PE1 is [75,149] rows of data in the first convolutional layer, totaling 75 rows, which are assigned to addresses address_1 to address_74 respectively. Alternatively, the first intermediate data row obtained from the first calculation result set 11 of PE0 may be written to address_0, and the first intermediate data row obtained from the first calculation result set 31 of PE2 may be written to address_75. The embodiments of this disclosure do not limit the specific implementation method of the data replication transmission process between calculation units.

[0064] For example, the scheduling method further includes step S40 (not shown), which involves decomposing an initial input matrix to obtain multiple initial input datasets, and then transmitting each of the multiple initial input datasets to multiple computing units to perform a first convolution calculation.

[0065] For example, the decomposition mode of the initial input matrix may be determined based on the size of the first convolutional layer obtained by performing a first convolution calculation on the initial input matrix, the number of computation units, the size of the internal memory, etc. For example, the decomposition mode may involve uniformly decomposing the initial input matrix into multiple initial input datasets, or decomposing the initial input matrix to correspond to multiple initial input datasets based on the size of the internal memory of the computation units, and then placing each into multiple computation units. For example, the initial input dataset may be a dataset of objects on which the first convolution calculation is performed.

[0066] For example, the scheduling method further includes step S50 (not shown) of obtaining at least a portion of the computational output of a multilayer convolutional neural network by collecting multiple sets of second computational results from multiple computing units, in response to the second convolutional layer being the output layer in the multilayer convolutional neural network. For example, if the second convolutional layer is the final convolutional layer after the computation of the last convolutional operator has been completed, the second convolutional layer is the output result. That is, the second convolutional layer consisting of sets of second computational results from multiple computing units is the computational output or at least a portion of the computational output of the multilayer convolutional neural network.

[0067] In embodiments of this disclosure, the plurality of computing units include at least two computing units, for example, a first computing unit and a second computing unit, and the multilayer convolutional neural network includes at least two convolutional layers. For example, the example shown in Figures 6A-6C includes three computing units PE0, PE1 and PE2, %5:[(1 224 224 64),F,FP32]=conv(%0:[(1 224 224 64),F,FP32], 1%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1,pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1), and %10:[(1 224 224 64),F,FP32]=conv(%5:[(1 224 224 64),F,FP32], 6%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1,pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1), includes two convolution operators (conv), For example, as shown in Figure 6A, the chip uses three computing units PE0, PE1, and PE2 to perform data-parallel convolution on 224 × 224 input data. For example, step S40 is executed to uniformly decompose the initial input matrix into approximately three initial input datasets 10, 20, and 30 with a height of H dimension, and load them into the three computing units PE0, PE1, and PE2, respectively, as shown in %0. For example, initial input dataset 10 contains [0,75], totaling 76 rows of data; initial input dataset 20 contains [74,150], totaling 77 rows of data; and initial input dataset 30 contains [149,223], totaling 75 rows of data. For example, step S10 is executed, and multiple computing units PE0, PE1, and PE2 each perform a first convolution calculation on three corresponding initial input datasets 10, 20, and 30, respectively, to obtain the first calculation result sets 11 ([0,74] rows), 21 ([75,149] rows), and 31 ([150,223] rows) shown in %5, and these first calculation result sets constitute the first convolution layer obtained by performing a first convolution calculation on the initial input matrix.

[0068] For example, step S20 is performed to determine that the three computing units PE0, PE1, and PE2 all have the same data replication transmission mode, i.e., bidirectional data transmission mode, according to an arrangement rule that ensures the distribution of data for the second convolutional layer in multiple computing units is the same as the data range of the first calculation result set in all of the multiple computing units. Next, step S30 is performed to determine that the three computing units PE0, PE1, and PE2 are all computing units that should perform data replication transmission, and that each computing unit requires one first intermediate data row. Then, each computing unit obtains the first intermediate data row required for padding when performing the second convolution calculation from another computing unit.

[0069] For example, PE0 is the first calculation unit and PE1 is the second calculation unit. As shown by arrow 1, the first calculation unit PE0 obtains the first intermediate data row

[75] necessary for padding in the second convolution calculation process by PE0 from the first calculation result set 21 of the second calculation unit PE1. As shown by arrow 2, the second calculation unit PE1 obtains the second intermediate data row

[74] necessary for padding in the second convolution calculation process by PE1 from the first calculation result set 11 of the first calculation unit PE0. The size of the first and second intermediate data rows obtained by the first calculation unit PE0 and the second calculation unit PE1 from each other is the same, and each row is a single data row.

[0070] For example, PE1 is the first calculation unit and PE2 is the second calculation unit. As shown by arrow 3, the first calculation unit PE1 obtains the first intermediate data row

[0150] necessary for padding in the second convolution calculation process by PE1 from the first calculation result set 31 in the second calculation unit PE2. As shown by arrow 4, the second calculation unit PE2 obtains the second intermediate data row

[0149] necessary for padding in the second convolution calculation process by PE2 from the first calculation result set 21 of the first calculation unit PE1. The size of the first and second intermediate data rows obtained by the first calculation unit PE1 and the second calculation unit PE2 from each other is the same, and each row is a single data row.

[0071] For example, the first calculation result set [0,74] in PE0 and the first intermediate data row

[75] obtained from PE1 constitute the dataset [0,75] required for the second convolution calculation by PE0.

[0072] For example, the first calculation result set [75,149] in PE1 and the second intermediate data row

[74] and the first intermediate data row

[0150] obtained from PE0 and PE2, respectively, constitute the dataset [74,150] required for the second convolution calculation by PE1.

[0073] For example, the first calculation result [150,223] in PE2 and the second intermediate data row

[0149] obtained from PE2 constitute the dataset [149,223] required for the second convolution calculation by PE2.

[0074] The three computing units PE0, PE1, and PE2 perform a second convolution calculation on the datasets [0,75], [74,150], and [149,223] after data replication transmission, obtaining the second calculation result sets [0,74], [75,149], and [150,223] shown in %10. Finally, step S50 is executed, and the combined second calculation result sets [0,74], [75,149], and [150,223] from PE0, PE1, and PE2 constitute the calculation result of the output layer.

[0075] Therefore, the scheduling method according to at least one embodiment of the present disclosure can improve the computational efficiency of a multilayer convolutional neural network by reducing the computational complexity of iterative data in a convolutional neural network and by using the effective data as padding rows through the replication transmission of effective data between computing units.

[0076] In another example shown in Figure 6B, in step S20, it can be determined that the three computing units PE0, PE1, and PE2 all have the same one-way data transmission mode. In step S30, it is determined that computing units PE1 and PE2 are the computing units that should perform data replication transmission, and that each computing unit requires two first intermediate data rows. That is, PE1 obtains two rows of valid data from PE0 for padding, and PE2 obtains two rows of valid data from PE1 for padding. For example, in Figure 6C, compared to Figure 6B, it may be that only computing unit PE1 is the computing unit that should perform data replication transmission. That is, PE1 obtains two rows of valid data from PE0 and PE2 respectively for padding. By using the scheduling method in the examples shown in Figures 6B and 6C, it is possible to not only reduce the number of repeated data calculations and improve computing efficiency, but also to reduce the overhead required for data transmission by reducing the number of data transmissions between computing units.

[0077] For example, a computing device may include more computing units, and a convolutional network may include more convolution operators, and the embodiments of this disclosure are not limited thereto. For example, the example shown in Figure 6D includes four computing units PE0, PE1, PE2 and PE3, %5:[(1 224 224 64),F,FP32]=conv(%0:[(1 224 224 64),F,FP32], 1%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1,pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1), %10:[(1 224 224 64),F,FP32]=conv(%5:[(1 224 224 64),F,FP32], 6%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1,pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1), and This includes three convolution operators (conv): %15:[(1 224 224 64),F,FP32]=conv(%10:[(1 224 224 64),F,FP32], 11%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1,pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1).

[0078] For example, as shown in Figure 6D, the multiple computing units PE0, PE1, PE2, and PE3 may have different data replication transmission modes. For example, the size of the first intermediate data row acquired by each computing unit from another computing unit may differ, and the embodiments of this disclosure are not limited thereto. For example, PE2 may acquire one first intermediate data row from PE1, while PE3 acquires two first intermediate data rows. The steps of the scheduling method in the example shown in Figure 6D may be described in reference to the explanation in Figure 6A, and a detailed explanation is omitted here. The scheduling method in the example shown in Figure 6D can not only reduce the repetitive calculation of data and improve the utilization rate of computing power, but can also achieve a balanced allocation of internal memory, computing power, and data transmission overhead between different computing units, taking into account the overall performance of the system.

[0079] Figure 7 is a schematic block diagram of a scheduling device according to some embodiments of the present disclosure. As shown in Figure 7, the scheduling device 100 comprises a computation control module 110, an allocation scheduling module 120, and a data transmission module 130. These components may be interconnected via a bus and / or other form of connection mechanism (not shown).

[0080] The computation control module 110 is configured such that multiple computation units each perform a first convolution computation on multiple corresponding datasets to obtain multiple corresponding first computation result sets, and the multiple first computation result sets constitute a first convolution layer obtained by the first convolution computation, and the multiple computation units include a first computation unit and a second computation unit, The allocation scheduling module 120 is configured to determine the data replication transmission mode corresponding to multiple sets of first calculation results in multiple computing units, according to the placement rules in multiple computing units for the second convolutional layer obtained by multiple computing units performing a second convolutional calculation on the first convolutional layer. The data transmission module 130 is configured such that the first computing unit, which is to pad the valid data rows among the multiple computing units, obtains the first intermediate data rows necessary for padding in the second convolution calculation process by the first computing unit from the first calculation result set in the second computing unit, based on the corresponding data replication transmission mode.

[0081] It should be noted that in the embodiments of this disclosure, each module of the scheduling device 100 corresponds to each step of the scheduling method described above, and the specific functions of the scheduling device 100 may be described in the relevant description of the scheduling method described above, and a detailed description is omitted here. The components and structure of the scheduling device 100 shown in Figure 7 are illustrative and not limiting, and the scheduling device 100 may further comprise other components and structures as needed.

[0082] Figure 8 is a schematic block diagram of an electronic device according to some embodiments of the present disclosure. As shown in Figure 8, the electronic device 200 includes a scheduling device 210, which may be a scheduling device according to any one embodiment of the present disclosure, for example, the scheduling device 100. The electronic device 200 may be any device having a computing function, for example, a server, terminal device, personal computer, etc., and the embodiments of the present disclosure are not limited thereto.

[0083] Figure 9 is a schematic block diagram of another electronic device relating to some embodiments of the present disclosure. As shown in Figure 9, the electronic device 300 comprises a processor 310 and a memory 320 and may be for realizing a client terminal or a server. The memory 320 is for non-instantaneous storage of computer executable instructions (e.g., at least one (one or more) computer program modules). The processor 310 is for executing the computer executable instructions, and when the computer executable instructions are executed by the processor 310, one or more steps in the convolution method described above can be performed, and the convolution method described above can be realized. The memory 320 and the processor 310 may be interconnected via a bus system and / or other forms of connection mechanisms (not shown).

[0084] For example, the processor 310 may be a central processing unit (CPU), a graphics processing unit (GPU), or another type of processing unit having data processing and / or program execution capabilities. For example, the central processing unit (CPU) may be an X86 or ARM architecture, etc. The processor 310 may be a general-purpose processor or a dedicated processor and can control other components in the electronic device 300 to perform desired functions.

[0085] For example, memory 320 may include any combination of at least one (e.g., one or more) computer program products, and computer program products may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) and / or cache memory (cache). Non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, etc. At least one (e.g., one or more) computer program modules may be stored in the computer-readable storage media, and the processor 310 can realize various functions of the electronic device 300 by executing at least one (e.g., one or more) computer program modules. Various application programs, various data, and various data used and / or generated in application programs may be further stored in the computer-readable storage media.

[0086] What needs to be explained is that, in the embodiments of this disclosure, the specific functions and technical effects of the electronic device 300 may be described by referring to the above description of the scheduling method, and a detailed explanation is omitted here.

[0087] Figure 10 is a schematic block diagram of another electronic device relating to some embodiments of the present disclosure. The electronic device 400 is suitable, for example, for implementing the scheduling method relating to the embodiments of the present disclosure. The electronic device 400 may be a terminal device, and may be for realizing a client terminal or a server. The electronic device 400 may include, but is not limited to, mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and wearable electronic devices, as well as fixed terminals such as digital TVs, desktop computers, and smart home devices. It should be noted that the electronic device 400 shown in Figure 10 is merely an example and does not limit the functions and scope of use of the embodiments of the present disclosure.

[0088] As shown in Figure 10, the electronic device 400 may include a processing unit (e.g., a central processing unit, a graphics processing unit, etc.) 410, which can perform various appropriate operations and processes based on a program stored in a read-only memory (ROM) 420 or a program loaded from a storage device 480 into a random access memory (RAM) 430. Various programs and data necessary for operation by the electronic device 400 are further stored in the RAM 430. The processing unit 410, ROM 420, and RAM 430 are connected to each other via a bus 440. An input / output (I / O) interface 450 is also connected to the bus 440.

[0089] Generally, input devices 460, including, for example, touch panels, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc., output devices 470, including, for example, liquid crystal displays (LCDs), loudspeakers, vibrators, etc., storage devices 480, including, for example, magnetic tapes, hard disks, etc., and communication devices 490 may be connected to the I / O interface 450. The communication device 490 can allow the electronic device 400 and other electronic devices to exchange data by wireless or wired communication. Figure 10 shows an electronic device 400 with various devices, but it should be understood that the electronic device 400 is not required to implement or have all of the illustrated devices, and may implement or have more or fewer devices instead.

[0090] For example, according to embodiments of the present disclosure, the scheduling method may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product which includes a computer program contained in a non-temporary computer-readable medium, and the computer program includes program code for executing the scheduling method. In such embodiments, the computer program may be downloaded and installed from a network by a communication device 490, or installed from a storage device 480, or installed from a ROM 420. When the computer program is executed by a processing device 410, the limited functions of the scheduling method according to embodiments of the present disclosure can be realized.

[0091] At least one embodiment of the present disclosure further provides a storage medium. By using the storage medium, the utilization rate of the matrix operation unit can be improved, the computational power of the matrix operation unit can be effectively utilized, the time of convolution operations can be reduced, computational efficiency can be improved, and data transmission time can be saved.

[0092] Figure 11 is a schematic diagram of a storage medium according to some embodiments of the present disclosure. For example, as shown in Figure 11, the storage medium 500 may be a non-temporary computer-readable storage medium in which non-temporary computer-readable instructions 510 are stored. When the non-temporary computer-readable instructions 510 are executed by the processor, the scheduling method described in the embodiments of the present disclosure can be realized, for example, when the non-temporary computer-readable instructions 510 are executed by the processor, one or more steps in the scheduling method described above can be performed.

[0093] For example, the storage medium 500 may be applied to the electronic device described above, and for example, the storage medium 500 may be equipped with memory 320 in the electronic device 300.

[0094] For example, the storage medium may include a smartphone memory card, a tablet computer memory component, a personal computer hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), flash memory, or any combination of the above storage mediums, or other applicable storage mediums.

[0095] For example, the description of the storage medium 500 may refer to the description of memory in the embodiment of the electronic device, and any overlapping explanations will be omitted. The specific functions and technical effects of the storage medium 500 may refer to the description of the scheduling method above, and a detailed explanation will be omitted here.

[0096] The above describes the scheduling method, scheduling device, electronic device, and storage medium according to embodiments of the present disclosure with reference to Figures 1 to 11. The scheduling method according to embodiments of the present disclosure can use effective data as padding rows by replicating and transmitting effective data between computing units, thereby reducing the computational complexity of iterative data in a convolutional neural network and improving the computational efficiency of a multilayer convolutional neural network.

[0097] It should be explained that, in the context of this disclosure, a computer-readable medium may be a tangible medium that contains or stores a program used by or in combination with an instruction execution system, apparatus, or device. A computer-readable medium may be a computer-readable signal medium, a computer-readable storage medium, or any combination of both. A computer-readable storage medium may be, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of more than these. Further specific examples of computer-readable storage media may include, but are not limited to, an electrical connection having one or more wires, a portable computer magnetic disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the above. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program, which may be used by or in combination with an instruction execution system, apparatus, or device. Furthermore, in this disclosure, the computer-readable signal medium may include data signals propagating in the baseband or as part of a carrier wave, and such data signals may include computer-readable program code. Such propagating data signals may take various forms, including, but not limited to, electromagnetic signals, optical signals, or any suitable combination thereof. The computer-readable signal medium may further be any computer-readable medium other than a computer-readable storage medium, and such computer-readable signal medium may transmit, propagate, or transmit programs for use by or in combination with instruction execution systems, apparatus, or devices. The program code contained in the computer-readable medium may be transmitted via any suitable medium, including, but not limited to, wires, optical cables, RF (radio frequency), or any suitable combination thereof.

[0098] In some embodiments, client terminals and servers may communicate using any currently known or future-to-be-developed network protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with digital data communications (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks ("LANs"), wide area networks ("WANs"), the Internet (e.g., the internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or future-to-be-developed networks.

[0099] The computer-readable media described above may be included in the electronic device described above, or it may exist independently but not assembled into the electronic device.

[0100] Computer program code for performing the operations of the Disclosure may be written in one or more programming languages or a combination thereof, and such programming languages include, but are not limited to, object-oriented programming languages such as Java, Smalltalk, and C++, and further include ordinary procedural programming languages such as the "C" language or similar programming languages. The program code may be executed entirely on the user's computer, partially on the user's computer, as a single standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. If a remote computer is involved, the remote computer may be connected to the user's computer via any type of network (including a local area network (LAN) or a wide area network (WAN)), or it may be connected to an external computer (for example, connected via the Internet using an Internet service provider).

[0101] The flowcharts and block diagrams in the drawings illustrate the implementable system architectures, functions, and operations of systems, methods, and computer program products relating to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or portion of code, which includes one or more executable instructions for implementing a defined logical function. It should be noted that in some substitute implementations, the functions represented in the blocks may be implemented in a different order than those indicated in the drawings. For example, two consecutively shown blocks may actually be executed essentially simultaneously, or they may be executed in reverse order, depending on the functions involved. It should be noted that each block in the block diagram and / or flowchart, and combinations of blocks in the block diagram and / or flowchart, may be implemented by a dedicated hardware-based system that performs the defined function or operation, or by a combination of dedicated hardware and computer instructions.

[0102] The units relating to the embodiments of this disclosure may be implemented in software or in hardware. The name of the unit may not be limited to the unit itself.

[0103] The functions described above in this specification may be performed by at least partially one or more hardware logic components. For example, exemplary hardware logic components that may be used include, but are not limited to, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems on chips (SOCs), complex-programmable logic devices (CPLDs), and the like.

[0104] According to one or more embodiments of the present disclosure, Example 1 provides a scheduling method for a multilayer convolutional neural network, wherein a plurality of computing units each perform a first convolution operation on a plurality of corresponding datasets to obtain a plurality of corresponding first result sets, the plurality of first result sets for constituting a first convolutional layer obtained by the first convolution operation, wherein the plurality of computing units include a first computing unit and a second computing unit, and the plurality of computing units determine a data replication transmission mode corresponding to the plurality of first result sets in the plurality of computing units according to placement rules in the plurality of computing units for a second convolutional layer obtained by performing a second convolution operation on the first convolutional layer, and the first computing unit to pad the valid data rows among the plurality of computing units obtains first intermediate data rows necessary for padding in the second convolution operation process by the first computing unit from the first result set in the second computing unit based on the corresponding data replication transmission mode.

[0105] In Example 2, based on the scheduling method described in Example 1, the data rows in the multiple first calculation result sets are consecutive in the first convolutional layer without overlapping.

[0106] In Example 3, based on the scheduling method described in Example 1, the first calculation result set in the first calculation unit and the first intermediate data row obtained from the second calculation unit constitute the dataset necessary for the first calculation unit to perform the second convolution calculation.

[0107] In Example 4, based on the scheduling method described in Example 1, the corresponding datasets used by each of the multiple computing units when performing the first convolution calculation are multiple initial input datasets.

[0108] In Example 5, based on the scheduling method described in Example 4, The method further includes decomposing the initial input matrix to obtain the plurality of initial input datasets, and transmitting each of the plurality of initial input datasets to the plurality of computing units to perform the first convolution calculation.

[0109] In Example 6, based on the scheduling method described in Example 1, the data replication transmission mode corresponding to the multiple first calculation result sets in the multiple calculation units is determined according to the arrangement rules in the multiple calculation units for the second convolutional layer obtained by the multiple calculation units performing the second convolutional calculation on the first convolutional layer. Based on the size of the internal memory of the memory in the plurality of computing units, the arrangement rule for the second convolutional layer in the plurality of computing units is determined, The distribution of the plurality of computing units in the second convolutional layer is determined according to the aforementioned arrangement rules, This includes determining the data replication transmission mode corresponding to the first calculation result set in the plurality of computing units, according to the distribution status.

[0110] In Example 7, based on the scheduling method described in Example 1, obtaining the first intermediate data row necessary for padding in the second convolution calculation process by the first calculation unit from the first calculation result set in the second calculation unit is: The size of the first intermediate data row is determined according to the relationship between the second calculation result set obtained by the first calculation unit performing the second convolution calculation and the first calculation result set obtained by the first calculation unit performing the first convolution calculation. This includes obtaining the first intermediate data row from the first calculation result set in the second calculation unit based on the data replication transmission mode and the size of the first intermediate data row.

[0111] In Example 8, based on the scheduling method described in Example 7, the size of the second calculation result set that the first computing unit attempts to acquire is the same as the size of the first calculation result set in the first computing unit.

[0112] In Example 9, based on the scheduling method described in Example 7, In response that the second convolutional layer is the output layer in the multilayer convolutional neural network, the method further includes obtaining the computational output of at least a portion of the multilayer convolutional neural network by collecting a plurality of second computation result sets from the plurality of computing units.

[0113] In Example 10, based on the scheduling method described in any one of Examples 1 to 9 above, The second computing unit further includes obtaining a second intermediate data row from the first calculation result set in the first computing unit, based on the corresponding data replication transmission mode, which is necessary for padding in the second convolution calculation process performed by the second computing unit.

[0114] In Example 11, based on the scheduling method described in Example 10, the sizes of the first intermediate data row and the second intermediate data row acquired by the first computing unit and the second computing unit from the other party are the same.

[0115] According to one or more embodiments of the present disclosure, Example 12 provides a scheduling device for a multilayer convolutional neural network. Multiple computing units are configured to perform a first convolution calculation on multiple corresponding datasets and obtain multiple corresponding first calculation result sets, wherein the multiple first calculation result sets constitute a first convolution layer obtained by the first convolution calculation, and the multiple computing units include a computing control module which includes a first computing unit and a second computing unit, An allocation scheduling module is configured to determine a data replication transmission mode corresponding to the plurality of first calculation result sets in the plurality of computing units, according to the arrangement rules in the plurality of computing units for the second convolution layer obtained by the plurality of computing units performing a second convolution calculation on the first convolution layer, The system includes a data transmission module configured such that a first computing unit, which is to pad the valid data rows among a plurality of computing units, obtains first intermediate data rows necessary for padding in the second convolution calculation process by the first computing unit from a first calculation result set in the second computing unit, based on the corresponding data replication transmission mode.

[0116] In Example 13, based on the scheduling device described in Example 12, the data rows in the plurality of first calculation result sets are consecutive in the first convolutional layer without overlapping.

[0117] In Example 14, based on the scheduling device described in Example 12, the first calculation result set in the first calculation unit and the first intermediate data row obtained from the second calculation unit constitute a dataset necessary for the first calculation unit to perform the second convolution calculation.

[0118] In Example 15, based on the scheduling device described in Example 12, the corresponding datasets used by each of the multiple computing units when performing the first convolution calculation are multiple initial input datasets.

[0119] In Example 16, based on the scheduling device described in Example 15, The system further includes a data decomposition module configured to decompose an initial input matrix to obtain a plurality of initial input datasets, and to transmit each of the plurality of initial input datasets to the plurality of computing units to perform the first convolution calculation.

[0120] In Example 17, based on the scheduling device described in Example 12, the data replication transmission mode corresponding to the plurality of first calculation result sets in the plurality of calculation units is determined according to the arrangement rules in the plurality of calculation units for the second convolutional layer obtained by the plurality of calculation units performing a second convolution calculation on the first convolutional layer. Based on the size of the internal memory of the memory in the plurality of computing units, the arrangement rule for the second convolutional layer in the plurality of computing units is determined, The distribution of the plurality of computing units in the second convolutional layer is determined according to the aforementioned arrangement rules, This includes determining the data replication transmission mode corresponding to the first calculation result set in the plurality of computing units, according to the distribution status.

[0121] In Example 18, based on the scheduling device described in Example 12, obtaining the first intermediate data row necessary for padding in the second convolution calculation process by the first calculation unit from the first calculation result set in the second calculation unit is: The size of the first intermediate data row is determined according to the relationship between the second calculation result set obtained by the first calculation unit performing the second convolution calculation and the first calculation result set obtained by the first calculation unit performing the first convolution calculation. This includes obtaining the first intermediate data row from the first calculation result set in the second calculation unit based on the data replication transmission mode and the size of the first intermediate data row.

[0122] In Example 19, based on the scheduling device described in Example 18, the size of the second calculation result set that the first computing unit attempts to acquire is the same as the size of the first calculation result set in the first computing unit.

[0123] In Example 20, based on the scheduling device described in Example 18, In response to the second convolutional layer being the output layer in the multilayer convolutional neural network, the system further comprises a data output module configured to obtain the computational output of at least a portion of the multilayer convolutional neural network by collecting a plurality of second computation result sets from the plurality of computing units.

[0124] In Example 21, based on the scheduling device described in any one of Examples 12 to 20, the data transmission module further: The second computing unit is configured to acquire a second intermediate data row from the first calculation result set in the first computing unit, based on the corresponding data replication transmission mode, which is necessary for padding in the second convolution calculation process performed by the second computing unit.

[0125] In Example 22, based on the scheduling device described in Example 10, the sizes of the first intermediate data row and the second intermediate data row acquired by the first and second computing units from each other are the same.

[0126] According to one or more embodiments of the present disclosure, Example 23 provides an electronic device comprising a scheduling device as described in any one of Examples 12 to 22.

[0127] According to one or more embodiments of the present disclosure, Example 24 provides an electronic device comprising a processor and a memory containing at least one computer program module, wherein the at least one computer program module is stored in the memory and configured to be executed by the processor, and the at least one computer program module is for implementing the scheduling method described in any one of Examples 1 to 11.

[0128] According to one or more embodiments of the present disclosure, Example 25 provides a storage medium on which non-temporary computer-readable instructions are stored, and when the non-temporary computer-readable instructions are executed by a computer, the scheduling method described in any one of Examples 1 to 11 is realized.

[0129] Although the present disclosure has already been described in detail through general descriptions and specific embodiments, it will be obvious to those skilled in the art that several modifications or improvements can be made based on the embodiments of the present disclosure. Accordingly, any such modifications or improvements made without departing from the spirit of the present disclosure will fall within the scope of the claims of the present disclosure.

[0130] This disclosure requires explanation in several respects.

[0131] (1) The drawings of the embodiments of the present disclosure relate only to the structures relating to the embodiments of the present disclosure, and other structures may refer to conventional designs.

[0132] (2) For clarity, in the drawings illustrating embodiments of the present disclosure, the thickness of layers or regions is enlarged or reduced, i.e., these drawings are not drawn to actual proportions.

[0133] (3) As long as there is no conflict, new embodiments can be obtained by combining the embodiments and features of the embodiments of this disclosure with each other.

[0134] The above description is merely a specific embodiment of the present disclosure, but the scope of protection of the present disclosure is not limited thereto, and the scope of protection of the present disclosure should be in accordance with the scope of protection of the claims described above.

Claims

1. A scheduling method for multilayer convolutional neural networks, Multiple computing units perform a first convolution calculation on multiple corresponding datasets to obtain multiple corresponding first calculation result sets, wherein the multiple first calculation result sets constitute a first convolution layer obtained by the first convolution calculation, and the multiple computing units include a first computing unit and a second computing unit. The plurality of computing units determine a data replication transmission mode corresponding to the plurality of first calculation result sets in the plurality of computing units, according to the arrangement rules in the plurality of computing units for the second convolution layer that the plurality of computing units intend to obtain by performing a second convolution calculation on the first convolution layer. A scheduling method for a multilayer convolutional neural network, comprising: a first computing unit that is to pad the valid data rows among the plurality of computing units obtaining a first intermediate data row necessary for padding in the second convolutional computation process by the first computing unit from a first computation result set in the second computing unit, based on the corresponding data replication transmission mode.

2. The scheduling method according to claim 1, wherein the data rows in the plurality of first calculation result sets are consecutive in the first convolutional layer without overlapping.

3. The scheduling method according to claim 1, wherein the first calculation result set in the first calculation unit and the first intermediate data row obtained from the second calculation unit constitute a dataset necessary for the first calculation unit to perform the second convolution calculation.

4. The scheduling method according to claim 1, wherein the corresponding datasets used by each of the plurality of computing units when performing the first convolution calculation are a plurality of initial input datasets.

5. The scheduling method according to claim 4, further comprising decomposing an initial input matrix to obtain a plurality of initial input datasets, and transmitting each of the plurality of initial input datasets to the plurality of computing units to perform the first convolution calculation.

6. Determining the data replication transmission mode corresponding to the plurality of first calculation result sets in the plurality of calculation units, according to the arrangement rules in the plurality of calculation units for the second convolutional layer that the plurality of calculation units intend to obtain by performing a second convolution calculation on the first convolutional layer, is: Based on the size of the internal memory of the memory in the plurality of computing units, the arrangement rule for the second convolutional layer in the plurality of computing units is determined. The distribution of the plurality of computing units in the second convolutional layer is determined according to the aforementioned arrangement rules, The scheduling method according to claim 1, further comprising determining the data replication transmission mode corresponding to the plurality of first calculation result sets in the plurality of computing units according to the distribution status.

7. Obtaining a first intermediate data row necessary for padding in the second convolution calculation process by the first calculation unit from the first calculation result set in the second calculation unit is: The size of the first intermediate data row is determined according to the relationship between the second calculation result set that the first calculation unit attempts to obtain by performing the second convolution calculation and the first calculation result set obtained by the first calculation unit by performing the first convolution calculation. The scheduling method according to claim 1, comprising obtaining the first intermediate data row from the first calculation result set in the second calculation unit based on the data replication transmission mode and the size of the first intermediate data row.

8. The scheduling method according to claim 7, wherein the size of the second calculation result set that the first calculation unit intends to acquire is the same as the size of the first calculation result set in the first calculation unit.

9. The scheduling method according to claim 7, further comprising obtaining the computational output of at least a portion of the multilayer convolutional neural network by collecting a plurality of second computation result sets from a plurality of computing units, in response that the second convolutional layer is the output layer in the multilayer convolutional neural network.

10. The scheduling method according to claim 1, further comprising the second computing unit obtaining a second intermediate data row necessary for padding in the second convolution calculation process by the second computing unit from the first calculation result set in the first computing unit based on the corresponding data replication transmission mode.

11. The scheduling method according to claim 10, wherein the sizes of the first intermediate data row and the second intermediate data row acquired by the first calculation unit and the second calculation unit from the other party are the same.

12. A scheduling device for multilayer convolutional neural networks, A computation control module configured such that multiple computing units each perform a first convolution calculation on multiple corresponding datasets to obtain multiple corresponding first calculation result sets, wherein the multiple first calculation result sets constitute a first convolution layer obtained by the first convolution calculation, and the multiple computing units include a computation control module including a first computing unit and a second computing unit, An allocation scheduling module configured to determine a data replication transmission mode corresponding to the plurality of first calculation result sets in the plurality of calculation units, according to the arrangement rules in the plurality of calculation units for the second convolution layer that the plurality of calculation units intend to obtain by performing a second convolution calculation on the first convolution layer, A scheduling device for a multilayer convolutional neural network, comprising: a data transmission module configured such that a first computing unit, which is to pad the valid data rows among the plurality of computing units, obtains first intermediate data rows necessary for padding in the second convolutional computation process by the first computing unit from a first computation result set in the second computing unit, based on the corresponding data replication transmission mode; and a data transmission module configured to do so.

13. An electronic device comprising the scheduling device described in claim 12.

14. It is an electronic device, Processor and A memory containing at least one computer program module, An electronic device wherein the at least one computer program module is stored in the memory and configured to be executed by the processor, and the at least one computer program module is for realizing the scheduling method according to any one of claims 1 to 11.

15. A storage medium, A storage medium that stores non-temporary computer-readable instructions, and when the non-temporary computer-readable instructions are executed by a computer, realizes the scheduling method according to any one of claims 1 to 11.