Data processing method for neural networks
By evaluating and selecting appropriate candidate batch sizes, the interlayer relationships of the neural network are optimized, solving the problem of low overall computational density in the neural network model and improving computational performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- MONTAGE TECHNOLOGY CO LTD
- Filing Date
- 2023-02-10
- Publication Date
- 2026-06-16
AI Technical Summary
Existing neural network models have low overall computational density, and existing roofline models fail to effectively optimize inter-layer relationships, thus limiting the improvement of computational performance.
By determining the actual storage locations of the input and output data of each layer of the neural network, evaluating the memory access situation corresponding to different batch size candidate values, selecting an appropriate batch size to optimize computational density and memory access, and improving computational performance by using batch processing.
It increases the overall computational density of the neural network, improves computational performance, optimizes inter-layer relationships, and enhances the computational performance of the computing platform.
Smart Images

Figure CN118485117B_ABST
Abstract
Description
Technical Field
[0001] This application relates to a neural network technology, and more specifically, to a data processing method for neural networks. Background Technology
[0002] In artificial intelligence applications, neural networks are a commonly used method. The computational performance of a computing platform is typically determined by data, neural network models, and hardware conditions. Theoretically, the maximum computational performance of a computing platform can be evaluated using the roofline model. According to the roofline model, when the computing platform is in a communication-bounded region, its performance can be improved by increasing the computation density (CCR), i.e., the computational load divided by the memory access load; conversely, when the computing platform is in a computation-bounded region, its performance can be improved by increasing its computing power. Generally, computing power resources are more abundant than storage resources; therefore, computing platforms are often more likely to be in memory-bounded regions, requiring an increase in computation density (CCR) to improve performance.
[0003] Different neural network models have different computational densities in each layer, and uneven computational density can significantly impact the overall computational performance of the neural network model. Current roofline models typically focus only on modeling and optimizing each individual layer, neglecting the overall inter-layer relationships. However, optimizing individual layers does not necessarily lead to improved overall computational performance.
[0004] Therefore, an improved data processing method suitable for neural networks is needed. Summary of the Invention
[0005] One objective of this application is to provide a data processing method for neural networks to address the problem of low overall computational density in existing neural network models.
[0006] In one aspect of this application, a data processing method for a neural network is provided. The neural network is implemented by a computing device and includes multiple layers. The computing device includes a first memory and a second memory. The weight data of each layer is stored in the second memory. The input data and output data of each layer have separate preset storage locations, wherein the preset storage location is either the first memory or the second memory. The method includes: batch processing the input data and weight data of each layer with a predetermined batch size; wherein the predetermined batch size is determined by: based on the input data and output data corresponding to each layer of the multiple layers. The system determines the actual storage locations of the input and output data of each layer when processing data in batches using different batch size candidate values, based on preset storage locations and the relationship between the required storage space and the storage space of the preset storage locations. Based on the determined actual storage locations, it determines the memory access status of each layer of the multiple layers to the second memory when processing data in batches using different batch size candidate values. Based on the memory access status corresponding to the different batch size candidate values, it determines the total memory access amount of the neural network corresponding to the different batch size candidate values. Based on the total memory access amount of the neural network corresponding to the different batch size candidate values and a predetermined selection rule, it selects the predetermined batch size from the different batch size candidate values.
[0007] In another aspect of this application, a method is provided for determining the batch size of a neural network for batch processing data. The neural network is implemented by a computing device and includes multiple layers. The computing device includes a first memory and a second memory. The weight data of each layer is stored in the second memory. The input data and output data of each layer have separate preset storage locations, where the preset storage locations are either the first memory or the second memory. The method includes: determining the actual storage locations of the input data and output data of each layer when batch processing data with different batch size candidate values, based on the preset storage locations corresponding to the input data and output data of each layer and the relationship between the required storage space and the storage space of the preset storage locations; determining the memory access status of each layer of the multiple layers to the second memory when batch processing data with different batch size candidate values, based on the determined actual storage locations; determining the total memory access of the neural network corresponding to different batch size candidate values based on the memory access status corresponding to the different batch size candidate values; and selecting a predetermined batch size from the different batch size candidate values based on the total memory access of the neural network corresponding to the different batch size candidate values and a predetermined selection rule.
[0008] The above is an overview of this application, and there may be simplifications, generalizations, and omissions of details. Therefore, those skilled in the art should recognize that this section is merely illustrative and not intended to limit the scope of this application in any way. This overview section is neither intended to identify the key or essential features of the claimed subject matter nor to serve as an aid in determining the scope of the claimed subject matter. Attached Figure Description
[0009] The above and other features of this application will become more fully clear through the following description and appended claims, in conjunction with the accompanying drawings. It is understood that these drawings depict only a few embodiments of the application and should not be construed as limiting the scope of the application. The application will be described more clearly and in more detail through the use of the drawings.
[0010] Figure 1 A schematic diagram of a data flow of a computing device is shown.
[0011] Figure 2 A flowchart is shown for a method for determining the batch size of data processed by a neural network according to an embodiment of this application.
[0012] Figures 3a to 6b A schematic diagram of a neural network accessing data in batches according to several embodiments of this application is shown. Detailed Implementation
[0013] In the following detailed description, reference is made to the accompanying drawings, which form a part thereof. In the drawings, similar symbols generally denote similar components unless the context otherwise requires. The illustrative embodiments described in the detailed description, drawings, and claims are not intended to be limiting. Other embodiments and variations may be employed without departing from the spirit or scope of the subject matter of this application. It will be understood that various different configurations, substitutions, combinations, and designs can be made to the various aspects of the general description and illustrated in the drawings of this application, all of which explicitly form part of the subject matter of this application.
[0014] In the embodiments of this application, the data to be processed by the neural network refers to the raw data processed by the computing device. The data to be processed may include multiple batches of data samples, and each batch may include multiple data samples (the case where each batch includes one data sample is equivalent to processing individual data samples one by one). A scenario for using a neural network for data processing is, for example, that the first layer of the neural network uses the weight parameters of the first layer to process a batch of data samples of the data to be processed, which constitutes the input data of the first layer, to obtain the output data of the first layer; thereafter, all or part of the output data of the first layer is used as the input data of the second layer, and processed according to the weight parameters corresponding to the second layer, to obtain the output data of the second layer; and so on for other layers. Furthermore, this application does not limit whether the neural network has a branching structure; for example, ResNet, Inception v3, etc., which have branching structures, or VGG (Visual Geometry Group) models without branching structures, are all within the scope of protection of this application.
[0015] Figure 1 A schematic diagram of a computing device 100 according to one embodiment of this application is shown. In some embodiments, the computing device 100 can be used to implement data processing by a neural network model; specifically, it can implement training of a neural network model configured on the computing device 100, and it can also implement inference of data by a neural network algorithm configured on the device 100. Figure 1 As shown, the computing device 100 may include a computing module 102 for performing data operations, a first memory 104 connected to the computing module 102, and a second memory 106. The first memory 104 may be integrated with the computing module 102, and the first memory 104 and the second memory 106 are connected and each has a certain amount of storage space. Depending on the specific circumstances, at least a portion of each of the first memory 104 and / or the second memory 106 may be used to store input data or output data. At least a portion of each of the first memory 104 and / or the second memory 106 may also be used to store other data, such as weight parameters required for neural network operations. It is understood that because the first memory 104 is integrated with the computing module 102, the data interaction rate between the first memory 104 and the computing module 102 is higher than that of the second memory 106. Therefore, data can be preferentially stored in the first memory 104, and only when the first memory 104 does not have sufficient storage space can the data be stored in the second memory 106. It is understood that in some embodiments, the first memory 104 may not be integrated with the arithmetic module 102, but similarly, the first memory 104 still has a higher priority than the second memory 106.
[0016] Continue to refer to Figure 1 The computing device 100 may further include a control module 108, which can control the arithmetic module 102, the first memory 104, and the second memory 106. Specifically, the control module 108 can control the data interaction between the arithmetic module 102, the first memory 104, and the second memory 106, or between two of them. For example, the control module 108 can use one control signal to control the integrated arithmetic module 102 and the first memory 104, and use another control signal to control the second memory 106.
[0017] It should be noted that the above description only illustrates one exemplary architecture of the computing device 100. Depending on the specific embodiment, the computing device may employ other structures. In some embodiments, the arithmetic module 102 and the first memory 104 are also integrated with other modules. In still other embodiments, the control module 108 controls the arithmetic module 102, the first memory 104, and the second memory 106 using different control signals. In yet other embodiments, the arithmetic module 102, the first memory 104, and the second memory 106 may be integrated together, or all three may directly transmit data between each other; however, the memory access volume of the computing system still refers to the amount of data interaction resulting from accesses to the second memory.
[0018] In some embodiments, the neural network performs operations on the data to be processed in the computing device 100. Specifically, the neural network includes multiple layers; for example, a convolutional neural network may include convolutional layers, pooling layers, activation layers, fully connected layers, etc. Some layers of the neural network have weight parameters, such as convolutional layers and fully connected layers. Taking a convolutional layer as an example, each convolutional layer processes the input data with a set of weight parameters to obtain output data. The weight parameters involved in each convolutional layer are generally stored in a second memory, while the storage locations of the input and output data may vary.
[0019] In one embodiment, when the computing module 102 of the computing device 100 performs calculations, the input data of the computing module 102 can come from the first memory 104 or the second memory 106, and the output data can also be stored in the first memory 104 or the second memory 106. The weight parameters are stored in the second memory 106, and when calculating at each layer, the weight data of that layer can be loaded from the second memory 106 to the first memory 104 so that the computing module can obtain the weight data from the first memory 104 for calculation.
[0020] The source of the input data can be described as follows. In some embodiments, the input data may all come from the first memory 104. In other embodiments, the input data may come from the second memory 106. When the input data needs to be processed, it can be loaded from the second memory 106 to the first memory 104 for the arithmetic module to perform calculations. During loading, the data can be loaded from the second memory 106 to the first memory 104 all at once, or it can be loaded in batches from the second memory 106 to the first memory 104.
[0021] The storage of output data is similar to that of input data, but the path is reversed. For example, in some embodiments, all output data can be stored in the first memory 104, while in other embodiments, the output data processed by the arithmetic module 102 is output to the first memory 104 and then stored from the first memory 104 to the second memory 106.
[0022] It is understood that the input data of the first layer of the neural network may come from the second memory 106, and the input data of other layers may come from the first memory 104 or the second memory 106, as described above.
[0023] In some embodiments, the computing module 102 may be a process element array (PE) in an artificial intelligence (AI) accelerator, the first memory 104 may be on-chip memory of the computing module 102, and the second memory 106 may be off-chip memory. In some embodiments, the first memory 104 is static random access memory (SRAM), and the second memory 106 is double data rate synchronous dynamic random access memory (DDR). Generally speaking, on-chip memory has a smaller storage space but a faster speed; off-chip memory has a larger storage space but a slower speed. For ease of explanation, on-chip memory and off-chip memory will be used below to represent the first memory 104 and the second memory 106, respectively, but this is not intended to limit the scope of this application.
[0024] It should be noted that, due to the relatively slow access speed of on-chip memory, accessing on-chip memory takes a considerable amount of time, significantly longer than accessing on-chip memory. Therefore, in some embodiments of this application, the calculation of memory access time only considers accesses to on-chip memory, and not accesses to on-chip memory.
[0025] The inventors of this application discovered that, considering the overall inter-layer relationships, the overall computational density (the number of computations that can be performed per byte of memory exchange) of a neural network model is related to the batch size. Increasing the batch size can improve computational density, thereby achieving higher overall computational performance. However, in actual computation, on-chip memory space is limited. Excessively increasing the batch size may require input and output data to be stored in off-chip memory, leading to increased access to off-chip memory. This increased access to off-chip memory may, in turn, reduce computational density. Therefore, selecting a suitable batch size becomes crucial. To address this issue, the inventors of this application propose a method for evaluating and selecting candidate batch size values that balances the increased computational density and increased access to memory brought about by increasing the batch size.
[0026] Figure 2 A flowchart for evaluating candidate values for different batch sizes is shown according to one embodiment of this application.
[0027] like Figure 2 As shown, in step 202, based on the preset storage locations corresponding to the input and output data of each layer of the neural network, and the relationship between the required storage space and the storage space of the preset storage locations, the actual storage locations of the input and output data of each layer when processing data in batches using different batch size candidate values are determined. In step 204, based on the determined actual storage locations, the memory access situation of each layer of the neural network when processing data in batches using different batch size candidate values is determined. In step 206, based on the memory access situation corresponding to different batch size candidate values, the total memory access amount of the neural network corresponding to different batch size candidate values is determined. In step 208, based on the total memory access amount of the neural network corresponding to different batch size candidate values and a predetermined selection rule, a predetermined batch size is selected from different batch size candidate values. In some embodiments, the selected predetermined batch size maximizes the computational density of the neural network for data processing. It can be understood that, when selecting, one of the multiple batch sizes corresponding to a relatively optimal (rather than the maximum) computational density can also be selected. For example, when selecting a batch size, other properties associated with different batch sizes, such as computational latency, can also be considered.
[0028] As mentioned above, each layer's input and output data has a preset storage location, which represents the target location for data storage. However, when processing data with different batch sizes, the storage space required for input and output data varies. Generally speaking, the larger the batch size, the larger the storage space required for input and output data. The storage space of the preset storage location may not be sufficient to meet the storage space required for input and output data under the current batch size. In this case, it is necessary to adjust the storage location of input and output data. That is, the preset storage location of input and output data may not be the actual storage location of input and output data.
[0029] Based on the preset storage locations of the input and output data corresponding to a layer in a neural network, as well as the storage space sizes of on-chip and off-chip memory, the memory access patterns of that layer to the off-chip memory can be determined, thereby calculating the memory access volume of that layer to the off-chip memory. The memory access patterns of a layer to the off-chip memory refer to the data interaction between that layer and the off-chip memory when processing data. Specifically, the calculation of memory access volume can be broadly divided into the following four scenarios.
[0030] Case 1: Both preset input and output data are stored in on-chip memory.
[0031] Figure 3a and Figure 3b The diagrams illustrate how, with and without batch processing, the input data 301a (or 301b) and output data 302a (or 302b) of a preset layer are both stored in on-chip memory. (Reference) Figure 3a It shows the memory access of one layer when processing N data samples corresponding to the candidate batch size N through a neural network. The memory access of this layer includes the memory access of the weight parameter 303, the memory access of the input data 301a corresponding to the N data samples, and the memory access of the output data 302a, which are expressed by equation (1) as follows:
[0032] D batch_layer =N*(D i +D o )+D W Equation (1)
[0033] Among them, D batch_layer This represents the memory access amount for a layer processing data with a batch size of N, where N is a positive integer greater than 1, and D... i To access the memory corresponding to a data sample in this layer, D o To store the memory access amount corresponding to the output data of a data sample in this layer, D wTo read the memory access amount of the weight parameter 303 for this layer. In equation (1), since the data is processed in batches, the weight parameter only needs to be read once in a batch with a batch size of N, and its corresponding memory access amount is D. w The memory access required to read input data corresponding to N data samples is N*D. i The memory access required to store the output data corresponding to N data samples is N*D. o .
[0034] Correspondingly, without batch input, N data samples require N data processing iterations through the neural network. (Reference) Figure 3b Without batch input, the memory access volume of this layer is the on-chip memory access volume of this layer when N data samples are processed one by one through the neural network. Specifically, the memory access volume of this layer includes the N memory access volumes of the weight parameter 303, the N times memory access volume of the input data 301b, and the N times memory access volume of the output data 302b, which is expressed as follows in equation (2):
[0035] D non_batch_layer =N*(D i +D o +D W Equation (2)
[0036] Among them, D non_batch_layer D represents the memory access volume of a layer when processing N data samples without batch input. i To access the memory corresponding to a data sample in this layer, D o To store the memory access amount corresponding to the output data of a data sample in this layer, D W This refers to the memory access amount for reading the weight data corresponding to a data sample in this layer.
[0037] Scenario 2: Input data is stored in on-chip memory, and output data is stored in on-chip memory.
[0038] Figure 4a-1 The diagram illustrates a layer in batch processing where input data 401a is stored in off-chip memory and output data 402a is stored in on-chip memory. In this case, the memory access amount for this layer includes the memory access amount for a single access of the weight parameter 403, the memory access amount for the input data 401a, and the memory access amount for the output data 402a. When the output data 402a is stored entirely in on-chip memory, the memory access amount is minimal and negligible.
[0039] Furthermore, it should be noted that if a batch size N is used, the on-chip memory storage space is smaller than the storage space required for the output data 402a of that layer. For example, the maximum batch size value corresponding to the on-chip memory storage space is M. output_on (M output_on(where N is a positive integer), and N > M output_on In other words, the on-chip memory cannot hold all the output data 402a. At this point, if... Figure 4a-2 As shown, the output data 402a still needs to be stored in the on-chip memory. Figure 4a-2 The diagram illustrates a layer in batch processing where input data 401a is stored in off-chip memory and output data 402a is stored in both on-chip and off-chip memory (i.e., the output data is partially transferred to on-chip memory and then to off-chip memory until all output data is stored). In this case, the memory access is the same as in case 1, and the memory access of this layer includes the memory access of the weight parameter 403, the memory access of the input data 401a of this layer, and the memory access of the output data 402a of this layer.
[0040] The size of the memory access in the two cases above is shown in equation (3):
[0041]
[0042] Among them, M output_on The maximum batch size value corresponding to the output data that the on-chip memory can store (due to hardware resource limitations or system-allocable storage space), if the currently selected batch size N is less than or equal to M. output_on If the output data is stored in on-chip memory, then all output data can be stored in on-chip memory; otherwise, all output data must be stored in off-chip memory. The maximum batch size may differ for different layers. In some embodiments, each layer may have its own maximum batch size. In still other embodiments, the same maximum batch size M is used for the entire neural network.
[0043] In contrast, without batch processing, N data samples need to be processed one by one through a neural network. Figure 4b This diagram illustrates that, without batch processing, the input data 401b of a layer is stored in off-chip memory while the output data 402b is stored in on-chip memory. (Refer to...) Figure 4b Without batch processing, the memory access volume of this layer includes N memory accesses of the weight parameter 403 and N times the memory access volume of the input data 401b of this layer. The memory access volume of the output data 402b is negligible because it is stored in on-chip memory. That is, it is expressed as follows in equation (4):
[0044] D non_batch_layer =N*(D i +D W Equation (4)
[0045] D non_batch_layer D represents the memory access volume of one layer when processing N data samples without batch processing.i To access the memory corresponding to a data sample in this layer, D W This refers to the memory access amount for reading the weight data corresponding to a data sample in this layer.
[0046] Case 3 – Input data is preset to be stored in on-chip memory, and output data is stored in off-chip memory.
[0047] Figure 5a-1 The diagram illustrates a layer where input data 501a is stored in on-chip memory and output data 502a is stored in off-chip memory during batch processing. In this case, the memory accesses for this layer include the memory accesses for the weight parameter 503, the memory accesses for the output data 502a, and the memory accesses for the input data 501a. When the input data 501a is read entirely from on-chip memory, the memory accesses are minimal and negligible.
[0048] Furthermore, it should be noted that if a batch size N is used, the on-chip memory storage space is smaller than the storage space required for the input data 501a of that layer. For example, the maximum batch size value corresponding to the on-chip memory storage space is M. input_on (M input_on (where N is a positive integer), and N > M input_on In other words, the on-chip memory cannot hold all of the input data 501a. At this point, if... Figure 5a-2 As shown, it is still necessary to read the input data 501a from the on-chip memory. Figure 5a-2 This diagram illustrates a layer's output data 502a stored in off-chip memory while input data 501a is read from both on-chip and off-chip memory during batch processing (i.e., input data is partially transferred from off-chip memory to on-chip memory until all input data is processed). In this case, the memory accesses are the same as in case 1, including the single memory access of the weight parameter 503, the memory accesses of the layer's input data 501a, and the memory accesses of the layer's output data 502a.
[0049] The size of the memory access in the two cases above is shown in equation (5):
[0050]
[0051] Among them, M input_on The maximum batch size value corresponding to the input data that the on-chip memory can store (due to hardware resource limitations or system-allocable storage space), if the currently selected batch size N is less than or equal to M. input_onIf the on-chip memory is correct, all input data can be read from the on-chip memory; otherwise, all input data must be read from the off-chip memory. The maximum batch size value may differ for different layers. In some embodiments, each layer may have its own maximum batch size value. In still other embodiments, the same maximum batch size value M is used for the entire neural network.
[0052] In contrast, without batch processing, N data samples need to be processed one by one through a neural network. Figure 5b This diagram illustrates that, without batch processing, the input data 501b of a layer is stored in on-chip memory while the output data 502b is stored in off-chip memory. (Refer to...) Figure 5b Without batch processing, the memory access of this layer includes N memory accesses of the weight parameter 503 and N times the memory access of the output data 502b of this layer. The memory access of the input data 501b, which is read from the on-chip memory, is negligible and can be ignored. That is, it is expressed as follows by equation (6):
[0053] D non_batch_layer =N*(D o +D W Equation (6)
[0054] D non_batch_layer D represents the memory access volume of one layer when processing N data samples without batch processing. o To store the memory access amount corresponding to the output data of a data sample in this layer, D W This refers to the memory access amount for reading the weight data corresponding to a data sample in this layer.
[0055] Case 4 – Input data is preset to be stored in on-chip memory, and output data is stored in on-chip memory.
[0056] Figure 6a-1 The diagram illustrates a scenario where, in batch processing, both the input data 601a and the output data 602a of a layer are stored in on-chip memory. In this case, the memory access amount for this layer includes the single memory access amount for the weight parameter 603, and the memory access amount for the input data 601a and the output data 602a of that layer. When both the input data 601a and the output data 602a are stored in on-chip memory, the memory access amount is minimal and can be ignored.
[0057] Furthermore, it should be noted that if a batch size N is used, the on-chip memory storage space is smaller than the storage space required for the input data 601a or output data 602a of that layer. For example, the maximum batch size value corresponding to the on-chip memory storage space is M. on (M on (where N is a positive integer), and N > M onIn other words, the on-chip memory space is less than the larger of the storage space occupied by input data 601a and the storage space occupied by output data 602a. In this case, such as... Figure 6a-2 As shown, the technical solution of this application stores both input data 601a and output data 602a in the on-chip memory and performs memory access. Figure 6a-2 A schematic diagram is shown where, in batch processing, the input data 601a and output data 602a of a layer are stored in both on-chip memory and off-chip memory (i.e., the input data is partially transferred from off-chip memory to on-chip memory until all input data is processed; the output data is partially transferred to on-chip memory and then to off-chip memory until all output data is stored). In this case, the memory access is the same as in case 1, and the memory access of this layer includes the memory access of the weight parameter 603, the memory access of the input data 601a of this layer, and the memory access of the output data 602a of this layer.
[0058] The size of the memory access in the two cases above is shown in equation (7):
[0059]
[0060] Among them, M on The smaller of the two maximum batch size values corresponding to the input or output data that the on-chip memory can store (due to hardware resource limitations or system-allocable storage space). If the currently selected batch size N is less than or equal to M. on If all input data is read from on-chip memory, all output data can be stored in on-chip memory; otherwise, all input data must be read from off-chip memory, and all output data must be stored in off-chip memory. The maximum batch size may differ for different layers. In some embodiments, each layer may have its own maximum batch size. In still other embodiments, the same maximum batch size M is used for the entire neural network.
[0061] In contrast, without batch processing, N data samples need to be processed one by one through a neural network. Figure 6b This diagram illustrates the access of input data 601b and output data 602b of a layer from on-chip memory without batch processing. (Refer to...) Figure 6b Without batch processing, the memory access of this layer includes N memory accesses of weight parameter 603. Since the input data 601b and the output data 602b are read from and stored in on-chip memory respectively, the memory access occupies very little and can be ignored, that is, as expressed by equation (8) as follows:
[0062] D non_batch_layer =N*D W Equation (8)
[0063] D non_batch_layer D represents the memory access volume of one layer when processing N data samples without batch processing. W This refers to the memory access amount for reading the weight data corresponding to a data sample in this layer.
[0064] like Figure 2 As mentioned above, after calculating the memory access volume for each layer based on the above conditions, the total memory access volume for the neural network can be calculated and determined. Specifically, the total memory access volume D when batch processing with a batch size of N is calculated. batch _N can be obtained by accumulating the memory access volume D corresponding to each layer. batch_layer The result is obtained. Correspondingly, without batch processing, the total memory access D for processing N data samples one by one is... non_batch _N can be obtained by accumulating the memory access volume D corresponding to each layer. non_batch_layer get.
[0065] After obtaining the total memory access, an evaluation criterion can be established between memory access and improved computing performance. Based on the Roof-line model, the difference in computational density CCR (the ratio of computational load to memory access) between batch processing with batch size N and without batch processing can be calculated, and this difference can be used to measure the improvement in computing performance. In some embodiments, the computational density improvement index Δ can be the difference in computational density CCR, as expressed by equation (9) as follows:
[0066]
[0067] Among them, CCR batch and CCR non_batch Let N and N represent the calculated density CCR when batch processing is performed with batch size N and when batch processing is not performed, respectively. In equation (9), the numerator of each term represents the computational cost of processing N data points, O i This represents the time complexity of data processing for a single data sample in the i-th layer, where n represents the number of network layers, and the denominator is the total memory access.
[0068] It is understandable that the total time complexity of a single data point is... Regardless of batch size. In some embodiments, the density enhancement index Δ can be calculated using a portion of the result from equation (9), i.e., expressed using equation (10) as follows:
[0069]
[0070] It is understandable that this involves a total number of memory accesses that do not use batch processing. The denominator is a summation. According to cases 1-4 above, each summation term contains a common factor N. Therefore, the common factor of the denominator can cancel out the N outside the parentheses, and the term after cancellation is independent of the batch size N. In some embodiments, the density enhancement index Δ can be calculated using a partial result of equation (10), that is, as expressed in equation (11) as follows:
[0071]
[0072] An assessment will be conducted.
[0073] After calculating the computational density improvement index Δ for each batch size candidate value, the batch size candidate value with a better computational density improvement index Δ can be selected as the batch size for neural network data processing. That is, the predetermined selection rule used to select the batch size from the batch size candidate values maximizes the computational density of the neural network data processing. It is understood that in some embodiments, other predetermined selection rules for selecting the batch size may be used depending on the actual application. For example, one or more additional conditions may be added to the predetermined selection rule.
[0074] By using the batch size selection method of this application, a predetermined batch size can be selected, and multiple layers of the neural network can process data in batches according to the predetermined batch size.
[0075] The method proposed in this application can be used in the training phase of a model to determine the batch size used during model training. The method can also be used in the inference phase of a model to infer a batch size of data samples at once.
[0076] The computational performance of the Inception v3 model was verified using the method proposed in this application. The candidate batch size was a natural number in the interval [2, 8], and the computational density enhancement index Δ shown in equation (10) was used for evaluation. The evaluation results obtained using this method are shown in Table 1. According to this method, a batch size of 6 is the optimal value within the interval [2, 8].
[0077]
[0078]
[0079] Table 1
[0080] The actual verification results are shown in Table 2. In Table 2, Cycle means the number of instruction cycles required for the neural network model inference under the current batch size candidate value, and Cycle / batch means the number of instruction cycles required on average for each batch size.
[0081] According to Table 2, in actual testing, the model exhibits the best computational performance when the batch size is 6, which is consistent with the method proposed in this application.
[0082] Batch size candidate values Cycles Cycle / batch 2 17790121 8,895,060 3 22574918 7,524,972 4 28584960 7,146,240 5 34536531 6,907,306 6 40456642 6,742,773 7 47602127 6,800,303 8 54533058 6,816,632
[0083] Table 2
[0084] It is understood that on-chip memory and off-chip memory can refer to GPU memory hardware, and this application is not limited thereto. As mentioned above, on-chip memory and off-chip memory can respectively refer to a first memory that does not occupy a large amount of memory access and a second memory that does occupy a large amount of memory access. For example, the access speed of on-chip memory is in the nanosecond range, and that of off-chip memory is in the microsecond range.
[0085] It is understood that the total memory access of a neural network can include the memory access of each layer of the neural network. In some embodiments, the computational cost of the neural network may mainly exist in the convolutional layers, and determining the total memory access of the neural network may at least include determining the sum of the memory access of the convolutional layers. In still other embodiments, the total memory access of the neural network may include the total memory access obtained by weighting and summing the memory access of different layers with different weights, and the different weights may be determined manually based on experience.
[0086] This application considers the overall interlayer relationships and increases the overall computational density of the model by finding a suitable batch size value, thereby improving the overall computational performance. Furthermore, this application considers the relationships between upper and lower layers and the space limitations of on-chip memory during performance analysis, which has significant practical implications.
[0087] Another aspect of this application provides a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform any of the data processing methods described above. The computer-readable medium referred to in this application includes various types of computer storage media, and can be any available medium accessible to a general-purpose or special-purpose computer. For example, a computer-readable medium may include RAM, ROM, EPROM, E2PROM, registers, hard disk, removable disk, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage device, or any other temporary or non-temporary medium capable of carrying or storing desired program code units having an instruction or data structure form and accessible by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. As used herein, disks typically magnetically copy data, while discs optically copy data using lasers. Combinations of the above should also be included within the scope of protection of computer-readable media. An exemplary storage medium is coupled to a processor so that the processor can read and write information from / to the storage medium. In an alternative, the storage medium may be integrated into the processor. The processor and storage medium may reside in an application-specific integrated circuit (ASIC).
[0088] It should be noted that although several steps of the data processing method for neural networks have been mentioned in the detailed description above, this division is merely exemplary and not mandatory. In fact, according to embodiments of this application, the features and functions of two or more modules described above can be embodied in one module. Conversely, the features and functions of one module described above can be further divided and embodied by multiple modules.
[0089] Those skilled in the art can understand and implement other modifications to the disclosed embodiments by studying the specification, the disclosure, the drawings, and the appended claims. In the claims, the word "comprising" does not exclude other elements and steps, and the words "a" or "an" do not exclude a plurality. In practical applications of this application, a single part may perform the function of multiple technical features referenced in the claims. Any reference numerals in the claims should not be construed as limiting the scope.
Claims
1. A data processing method for neural networks, characterized in that, The neural network is implemented by a computing device and includes multiple layers. The computing device includes a first memory and a second memory. The weight data of each layer is stored in the second memory. The input data and output data of each layer have separate preset storage locations, which are either the first memory or the second memory. The method includes: Each of the plurality of layers batches its input data and weight data with a predetermined batch size; wherein the predetermined batch size is determined in the following manner: Based on the preset storage locations corresponding to the input and output data of each of the multiple layers, and the relationship between the required storage space and the storage space of the preset storage locations, the actual storage locations of the input and output data of each layer are determined when processing data in batches using different batch size candidate values. Based on the determined actual storage location, determine the memory access status of each of the multiple layers to the second memory when processing data in batches using different batch size candidate values; Based on the memory access patterns corresponding to the different batch size candidate values, determine the total memory access volume of the neural network corresponding to the different batch size candidate values; Based on the total memory access of the neural network corresponding to the different batch size candidate values and the predetermined selection rule, the predetermined batch size is selected from the different batch size candidate values.
2. The data processing method according to claim 1, characterized in that, In the step of selecting the predetermined batch size from the different batch size candidate values, the predetermined selection rule includes selecting a predetermined batch size that maximizes the computational density of the neural network for data processing.
3. The data processing method according to claim 1, characterized in that, The preset storage locations for the input and output data of one layer of the neural network are both the second memory; The input data of the layer is read from the second memory; the output data of the layer is stored in the second memory.
4. The data processing method according to claim 1, characterized in that, The preset storage locations for the input and output data of a layer of the neural network are the second memory and the first memory, respectively. The input data for the layer is read from the second memory; If the storage space of the first memory is sufficient to accommodate the output data of the layer, then the output data of the layer is stored in the first memory; otherwise, the output data of the layer is stored in the second memory.
5. The data processing method according to claim 1, characterized in that, The preset storage locations for the input data and output data of a layer of the neural network are the first memory and the second memory, respectively. If the storage space of the first memory is sufficient to accommodate the input data of the layer, the input data of the layer is read from the first memory; otherwise, the input data of the layer is read from the second memory; the output data of the layer is stored in the second memory.
6. The data processing method according to claim 1, characterized in that, The preset storage locations for the input and output data of a layer of the neural network are both the first memory; If the storage space of the first memory is greater than the larger of the storage space occupied by the input data of the layer and the storage space occupied by the output data of the layer, then the input data of the layer is read from the first memory and the output data of the layer is stored in the first memory. Otherwise, the input data of the layer is read from the second memory, and the output data of the layer is stored in the second memory.
7. The data processing method according to claim 1, characterized in that, Selecting the predetermined batch size from the different batch size candidate values, based on the total memory access of the neural network corresponding to the different batch size candidate values and a predetermined selection rule, includes: For different batch size candidate values N, calculate the total memory access D when batch processing N data samples. batch The total memory access D when processing N data samples one by one. non_batch _N, to determine the computational density improvement index Δ, where, Where N is a positive integer greater than 1, and, Among the candidate batch size values with different values, the candidate batch size value that maximizes the calculated density improvement index Δ is selected as the predetermined batch size.
8. The data processing method according to claim 1, characterized in that, The first memory is an on-chip memory, and the second memory is an off-chip memory.
9. A method for determining the batch size of data processed by a neural network, characterized in that, The neural network is implemented by a computing device and includes multiple layers. The computing device includes a first memory and a second memory. The weight data of each layer is stored in the second memory. The input data and output data of each layer have separate preset storage locations, which are either the first memory or the second memory. The method includes: Based on the preset storage locations corresponding to the input and output data of each of the multiple layers, and the relationship between the required storage space and the storage space of the preset storage locations, the actual storage locations of the input and output data of each layer are determined when processing data in batches using different batch size candidate values. Based on the determined actual storage location, determine the memory access status of each of the multiple layers to the second memory when processing data in batches using different batch size candidate values; Based on the memory access patterns corresponding to the different batch size candidate values, determine the total memory access volume of the neural network corresponding to the different batch size candidate values; Based on the total memory access of the neural network corresponding to the different batch size candidate values and the predetermined selection rules, a predetermined batch size is selected from the different batch size candidate values.
10. The method according to claim 9, characterized in that, In the step of selecting a predetermined batch size from the different batch size candidate values, the predetermined selection rule includes selecting a predetermined batch size that maximizes the computational density of the neural network for data processing.
11. The method according to claim 9, characterized in that, The preset storage locations for the input and output data of one layer of the neural network are both the second memory; The input data of the layer is read from the second memory; the output data of the layer is stored in the second memory.
12. The method according to claim 9, characterized in that, The preset storage locations for the input and output data of a layer of the neural network are the second memory and the first memory, respectively. The input data for the layer is read from the second memory; If the storage space of the first memory is sufficient to accommodate the output data of the layer, then the output data of the layer is stored in the first memory; otherwise, the output data of the layer is stored in the second memory.
13. The method according to claim 9, characterized in that, The preset storage locations for the input data and output data of a layer of the neural network are the first memory and the second memory, respectively. If the storage space of the first memory is sufficient to accommodate the input data of the layer, the input data of the layer is read from the first memory; otherwise, the input data of the layer is read from the second memory; the output data of the layer is stored in the second memory.
14. The method according to claim 9, characterized in that, The preset storage locations for the input and output data of a layer of the neural network are both the first memory; If the storage space of the first memory is greater than the larger of the storage space occupied by the input data of the layer and the storage space occupied by the output data of the layer, then the input data of the layer is read from the first memory and the output data of the layer is stored in the first memory. Otherwise, the input data of the layer is read from the second memory, and the output data of the layer is stored in the second memory.
15. The method according to claim 9, characterized in that, Selecting a predetermined batch size from the different batch size candidate values, based on the total memory access of the neural network corresponding to the different batch size candidate values and a predetermined selection rule, includes: For different batch size candidate values N, calculate the total memory access D when batch processing N data samples. batch The total memory access D when processing N data samples one by one. non_batch _N, to determine the computational density improvement index Δ, where, Where N is a positive integer greater than 1, and, Among the candidate batch size values with different values, the candidate batch size with the largest calculated density improvement index Δ is selected as the predetermined batch size.
16. The method according to claim 9, characterized in that, The first memory is an on-chip memory, and the second memory is an off-chip memory.