Adaptive memory allocation method and apparatus for CNN inference oriented by vector accelerator
By identifying memory hotspots and dynamically searching for the minimum fission factor, the optimal graph transformation and storage allocation scheme is generated, solving the problems of resource waste and poor cross-platform adaptability of CNN models on vector accelerators, and achieving efficient memory management and performance improvement.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NAT UNIV OF DEFENSE TECH
- Filing Date
- 2026-06-04
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies for deploying CNN models on vector accelerators suffer from resource waste, ineffective optimization overhead, and poor cross-platform adaptability. In particular, the fission transformation method is difficult to optimize for key operators, which limits the overall performance improvement.
By identifying memory hotspots, an adaptive memory allocation method is constructed, the minimum fission factor is dynamically searched, and the optimal graph transformation and storage allocation scheme is generated, avoiding ineffective optimization and improving cross-platform applicability.
It effectively avoids ineffective optimization operations, improves resource utilization and computational efficiency, enhances cross-platform applicability, and ensures that fission optimization is only performed when it brings net performance benefits.
Smart Images

Figure CN122309181A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of vector accelerator compilation optimization technology, and in particular to an adaptive memory allocation method and apparatus for vector accelerator inference CNNs. Background Technology
[0002] With the widespread deployment of Convolutional Neural Networks (CNNs) in edge computing, vector accelerators, due to their high performance and low power consumption, have gradually become the target hardware platform for deploying CNN models, such as Intel's Habana Gaudi2, Greco AI-specific chips, and Texas Instruments' (TI) C7x DSP. Vector accelerators typically have unique architectural features different from mainstream processors (such as x86 CPUs and GPUs), mainly composed of multiple general-purpose digital signal processor clusters interconnected via on-chip networks. Their storage system has a three-level hierarchical structure: large-capacity, high-latency DDR memory; medium-capacity, medium-speed Global Shared Memory (GSM); and small-capacity, high-speed Array Memory (AM) and Scalar Memory (SM). During CNN model inference, a large number of feature maps (intermediate tensors) are generated. These feature maps are typically allocated in the Global Shared Memory (GSM) or off-chip DDR memory in vector accelerators. Ideally, intermediate tensors are preferentially allocated to the GSM to take advantage of its higher bandwidth and lower access latency. When the tensor size exceeds the GSM capacity, it needs to be split into several sub-tensors through a fission operation to adapt to the storage constraints of GSM. However, fission introduces additional computational and data migration overhead. If the total overhead exceeds the time saved by reducing DDR accesses, it will actually reduce the overall inference performance. Therefore, how to efficiently manage and allocate memory resources becomes a key challenge for efficient inference of CNN models on vector accelerators.
[0003] Graph equivalence transformation-based fission transformation methods have garnered significant attention in deploying neural network models for resource-constrained devices. These methods aim to reduce peak memory usage during model inference by breaking down large operators into multiple, more computationally granular operators. However, applying existing fission transformation methods to vector accelerators still presents the following challenges: (1) Optimization of resource waste and limited overall performance improvement. In the existing technology, the fission optimization of fission transformation methods is usually difficult to focus on key operators, and fission of non-memory bottleneck tensors cannot effectively reduce the memory peak during CNN inference, resulting in optimization of resource waste and limited overall performance improvement. (2) There is a lot of ineffective optimization overhead. In the existing technology, fission methods usually use "whether the memory constraint capacity is met" as the decision-making basis for whether to fission, ignoring the additional computational overhead and data migration overhead introduced by the fission operation itself, which may result in "ineffective allocation" and generate a lot of ineffective optimization overhead. For example, even if the tensor is successfully allocated to GSM, the fission overhead exceeds the memory access benefit brought by using GSM, which leads to an increase in the overall execution time.
[0004] (3) Poor cross-platform adaptability. Different target vector processors have different storage capacities. Using fixed fission factors or empirical rules, it is impossible to generate customized fission and allocation schemes for the storage capacity constraints of specific hardware, resulting in poor applicability of optimization strategies when deployed across platforms. Summary of the Invention
[0005] The technical problem this invention aims to solve is as follows: Addressing the aforementioned issues in existing technologies, this invention provides an adaptive memory allocation method and apparatus for vector accelerator inference CNNs that is simple to implement, low in cost, high in resource utilization, low in computational overhead, and highly adaptable to cross-platform use. It can effectively avoid ineffective optimization overhead on non-critical operators by identifying memory hotspots that cause memory bottlenecks, and can quantitatively evaluate the net performance gains from fission operations, avoiding ineffective fission. Furthermore, it can automatically generate optimal graph transformation and storage allocation schemes for vector accelerators with different memory configurations, improving cross-platform applicability.
[0006] To solve the above-mentioned technical problems, the technical solution proposed by this invention is as follows: An adaptive memory allocation method for vector-accelerator-based inference CNNs, comprising the following steps: Step S1: Obtain the original CNN model for analysis, construct the intermediate representation structure of the computation graph to obtain the initial computation graph. The intermediate representation structure includes a ValueT structure for representing tensor data, an Op structure for representing computation operations, and a Graph structure for representing the computation process and data dependencies. Step S2: Optimize the initial computation graph according to the predefined operator fusion rules to obtain the initial optimized computation graph; Step S3: Perform memory analysis on the current computation graph to identify memory hotspots during the CNN model inference process and obtain a memory hotspot subgraph; Step S4: Starting from the minimum fission factor, the memory hotspot subgraph is fissullied, and the GSM capacity constraint that satisfies the target vector accelerator is found through iterative search. The minimum feasible fission factor; Step S5: Construct a cost model for the CNN model operator to quantify the execution time overhead of the memory hotspot subgraph. Calculate the time overhead of the memory hotspot subgraph before and after fission when fissioning according to the minimum feasible fission factor based on the cost model of the CNN model operator. Determine a fission decision for the memory hotspot subgraph based on the cost before and the execution time overhead after fission. Apply the determined fission decision to the computation graph to obtain the current optimized computation graph. Step S6: Determine the storage allocation strategy for each intermediate tensor in the memory hotspot subgraph based on the fission decision determined in step S5; Step S7: Determine whether the current global memory peak of the computation graph exceeds the GSM capacity constraint. If yes, return to step S3; otherwise, output the final optimized computation graph and intermediate tensor storage allocation mapping table.
[0007] Further, step S3 includes: Step S3.1: Construct a tensor activity table based on the scheduling sequence of the computation graph. The active table is used to record the effective survival time of each tensor during the execution of the CNN model, and the scheduling sequence is a computation graph topology sort that satisfies the data dependencies of the CNN computation graph. Step S3.2: Calculate the memory usage at each scheduling time based on the tensor activity table constructed in step S3.1; Step S3.3: Obtain the memory hotspots of the computation graph: query the peak values from the memory usage at each scheduling time. The peak value is obtained by representing the peak memory usage during the execution of the computation graph. Timestamps in memory Query the tensor active table middle The set of target tensors that are active at all times is obtained, and the Op operations corresponding to each tensor in the target tensor set are obtained to obtain the memory hotspot subgraph.
[0008] Further, step S3.1 includes: Based on the computation graph-based scheduling sequence sched, the lifecycle active interval of each operator is calculated. , ],in Indicates the start time, that is, the point in time when the operator begins execution; The end time indicates the scheduling time when the output tensor generated by the operator is last used by the downstream consumer operator; the active interval [ , [] indicates the time range during which the operator remains active in memory; Based on the active range of each operator [ , Construct a tensor active table ,in, Represents the set of operators in a computation graph. Indicates the number of operators. Indicates the scheduling time; the rows of the tensor active table A represent operator nodes, and the columns correspond to timestamps and elements in the active table. The value can be:
[0009] in, Operators timestamp It is active. 0 indicates an operator. timestamp It is inactive. , Operators The start and end times of the active range.
[0010] Further, step S4 includes: Step S4.1: Classify computation graph nodes: Based on the relative position of the nodes in the data flow, the nodes in the memory hotspot subgraph are divided into three categories: root node, middle node, and leaf node. Based on the computational characteristics of the operators, the operators are divided into sliding window operators and element-wise operators. Step S4.2: Reconstruct the height dimension of the memory hotspot tensor: Based on the classification of nodes in step S4.1, obtain the category of each operator, and perform a reverse traversal from the leaf node to the root node, based on the current fission factor. Recursively recalculate the output tensor height of each operator. With input tensor height The fission factor The initial value is the minimum fission factor, and the height dimension of the input tensor and output tensor of each operator is calculated iteratively and recursively until the height dimension of the input and output tensors of all nodes has been recalculated, and the height dimension reconstruction result of the memory hotspot tensor is obtained. Step S4.3: Fission of memory hotspot subgraph and reconstruction of computation graph: Based on the height dimension reconstruction result obtained in step S4.2, the memory hotspot subgraph in the original computation graph old_graph is split and reconstructed to generate an optimized new computation graph new_graph; Step S4.4. Calculate the peak memory usage of the optimized new computation graph new_graph Compare the current peak memory usage. With GSM capacity constraints ,like Then record the current fission factor. To meet GSM capacity constraints If the minimum feasible fission factor is found, proceed to step S5; otherwise, let... Then return to step S4.2 to continue the search.
[0011] Furthermore, in step S4.2, for leaf nodes, the output tensor height of the sliding window operator is... ,in For the floor function, input tensor height ,in, Step size, Given the convolution kernel size, if the current fission copy is at the beginning or end of the sequence, boundary padding correction is performed: the calculation of the input height is corrected to... ,in For the fill size, if (original input height + 2P - K) cannot be filled... If divisible, perform step-size divisibility correction: adjust the input height... Add 1; the height of the output tensor of the element-wise operator type is The input tensor height is The output tensor height of the sliding window operator at the intermediate and root node positions equals the input tensor height of the downstream operator. And perform the boundary filling correction and the step size divisibility correction; the output tensor height of the element-wise operator type at the intermediate node and root node positions. Input tensor height .
[0012] Further, step S4.3 includes: Step S4.3.1: Initialize the new computation graph: Copy the original computation graph old_graph to generate the initial new computation graph new_graph. According to the new tensor dimension reconstructed in step S4.2, traverse and update the input and output tensor dimensions of the operators of the memory hotspot subgraph in the current new computation graph new_graph. Use the corresponding memory hotspot in the current new computation graph new_graph structure as the first copy of the memory hotspot subgraph after fission. Step S4.3.2: Insert Concat operator: Insert a Concat operator after the leaf node of the first replica of the memory hotspot subgraph to be used as a concatenation operator to merge the outputs of multiple fission replicas; Step S4.3.3: Construct the mapping tables of the computation graph: Construct the vertex mapping table nameToVertex of the computation graph to store the mapping from vertex names to vertex pointers in the graph, construct the operator mapping table nameToOp to store the mapping from operator names to operator pointers, and construct the tensor mapping table nameToValue to store the mapping from all tensor names to tensor pointers; Step S4.3.4: Create and connect new subgraph copies: if the fission factor of the current memory hotspot subgraph is... When the value is greater than 1, new memory hotspot subgraph copies are generated and connected sequentially. Step S4.3.5: Update the predecessor and successor relationships of each operator in the new computation graph new_graph to generate the optimized new computation graph new_graph.
[0013] Furthermore, in step S4.3.4, the steps for generating and connecting the new memory hotspot subgraph copy are as follows: Create new operator node: For each memory hotspot subgraph replica, traverse the hotspot subgraph in data flow hierarchy order, create an OP-type replica operator new_op for the original operator original_op, and set the new operator name; Create a new operator node's input tensor: Create a new tensor new_input for each non-parametric input of the replica operator new_op, set the tensor name, establish the definition relationship with the corresponding upstream operator of the replica, and assign the new input tensor to the input of the new operator; Create output tensors for new operator nodes: Create a new tensor new_output for each output of the copy operator new_op and set the name of the new output tensor, copy all the attributes of the original output, establish the output tensor definition relationship, and add the new output tensor to the output tensor set of the new operator new_op; Establish connections between operators: connect the input of the replica operator new_op to the output of the upstream operator of the corresponding replica, and connect the output of the replica operator new_op to the input of the downstream operator of the corresponding replica. If it is a leaf node replica, connect it to the input of the concatenation operator concat_op. Update computation graph: Insert the newly created operator into the current new computation graph new_graph; The newly created operators and input / output tensor information are synchronously updated in the vertex mapping table, operator mapping table, and tensor mapping table.
[0014] Furthermore, in step S5, the time overhead cost before the fission of the memory hotspot subgraph... Including the floating-point computation overhead and data transfer overhead of all operators, the calculation expression is:
[0015] in, This represents the peak floating-point computing performance of the vector accelerator. Indicates DDR memory access bandwidth. For the operators corresponding to the memory hotspot subgraph, This represents the i-th operator in the memory hotspot subgraph. Indicates the number of bytes occupied by the data type. express The floating-point computational complexity of the operator express Memory usage of operator weight tensors express Memory usage of operator input tensors express Memory usage of operator output tensors; Execution time overhead after memory hotspot subgraph splitting Including the additional computational and data transmission overhead introduced by fission, the calculation expression is:
[0016] in, This represents the peak floating-point computing performance of the vector accelerator. Indicates DDR memory access bandwidth. Indicates GSM storage bandwidth. Indicates the first part The floating-point computational complexity of the operator Indicates the first part Memory usage of operator weight tensors No. part Memory usage of operator input tensors No. part The memory usage of the operator output tensor, where N represents the number of sub-plots generated after the fission; The step of determining the fission decision for the memory hotspot subgraph based on the cost before fission and the execution time cost after fission includes: if the execution time cost after fission is less than the time cost before fission, then fission is performed on the current memory hotspot subgraph; otherwise, fission is not performed on the current memory hotspot subgraph.
[0017] Further, step S6 includes: A tensor storage allocation map is defined to record the final storage allocation decision for each tensor. Each record in the tensor storage allocation map corresponds to a tensor in the computation graph. Each record includes the following fields: sequence number, tensor identifier, decision result, and storage location. The decision result represents the fission decision of the entire memory hotspot subgraph associated with the current tensor. The storage location is used to indicate the final physical storage location of the current tensor on the target hardware. If the fission decision of the current memory hotspot subgraph is fission, then update the tensor storage allocation mapping table, mark the decision result of the intermediate tensor generated by the current memory hotspot subgraph as fission, and mark the storage location as GSM; if the decision result of the memory hotspot subgraph is no fission, then update the tensor storage allocation mapping table, mark the fission decision result of the intermediate tensor generated by the current memory hotspot subgraph as no fission, and mark the storage location as DDR.
[0018] A computer device includes a processor and a memory, the memory being used to store a computer program, and the processor being used to execute the computer program to perform the method described above.
[0019] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. This invention identifies memory hotspots that cause memory bottlenecks and determines fission decisions based on memory hotspot subgraphs. This ensures that fission optimization can directly address performance bottlenecks and effectively avoids ineffective optimization operations on non-critical operators.
[0020] 2. In determining the fission decision of a memory hotspot subgraph, this invention constructs a hardware-aware cost model by utilizing the time overhead of the memory hotspot subgraph before fission and the execution time overhead after fission. Based on the time overhead before and after fission, the fission decision of the memory hotspot subgraph is determined. This can improve the fission decision criterion from "whether the intermediate tensor can be put into memory" to "whether it can improve inference speed". This allows fission to be performed only when it brings net performance benefits, solving the performance degradation problem caused by blind fission in traditional methods.
[0021] 3. This invention dynamically searches for the minimum fission factor that satisfies the GSM capacity constraint based on the specific memory capacity constraint of the target vector accelerator. It can support fission for sliding windows and element-wise operators, and can also automatically generate the optimal graph transformation and storage allocation scheme for vector accelerators with different memory configurations, thereby improving cross-platform applicability. Attached Figure Description
[0022] Figure 1 This is a schematic diagram illustrating the overall process of the adaptive memory allocation method for vector accelerator-oriented inference CNNs in this embodiment.
[0023] Figure 2This is a schematic diagram of the computation graph of the CNN model (testModel) in a specific application embodiment of the present invention.
[0024] Figure 3 This is a schematic diagram of the Graph structure constructed in a specific application embodiment of the present invention.
[0025] Figure 4 This is a schematic diagram of the computation graph after operator fusion obtained in a specific application embodiment of the present invention.
[0026] Figure 5 This is a schematic diagram of the Op structure constructed in a specific application embodiment of the present invention (taking the conv1+relu operator as an example).
[0027] Figure 6 This is a schematic diagram of the memory hotspot sub-graph identified in a specific application embodiment of the present invention.
[0028] Figure 7 The pair diagram obtained in a specific application embodiment of the present invention subgraph1 A schematic diagram of the new dimension information after the execution dimension reconstruction.
[0029] Figure 8 The pair diagram obtained in a specific application embodiment of the present invention subgraph2 A schematic diagram of the new dimension information after the execution dimension reconstruction.
[0030] Figure 9 This is a schematic diagram of the optimized complete computation graph obtained in a specific application embodiment of the present invention.
[0031] Figure 10 This is a schematic diagram of a storage allocation scheme generated in a specific application embodiment of the present invention. Detailed Implementation
[0032] The present invention will be further described below with reference to the accompanying drawings and specific preferred embodiments, but this does not limit the scope of protection of the present invention.
[0033] like Figure 1 As shown in this embodiment, the adaptive memory allocation method for CNN inference using vector accelerators includes the following steps: Step S1: Obtain the original CNN model, parse it, and construct the intermediate representation structure of the computation graph to obtain the initial computation graph.
[0034] In this embodiment, three core intermediate representation structures are defined: the ValueT structure (tensor value structure) for representing tensor data, the Op structure for representing computational operations, and the Graph structure for representing computational processes and data dependencies.
[0035] Specifically, the ValueT structure is a unified representation of data flow in a computation graph, and its key attributes include: (1) Name: A unique identifier for a tensor; (2) Value type (kind): used to distinguish the semantic role of tensors in the computation graph, including three types: input type (INPUT): representing the input data of the model, parameter type (PARAM): representing the persistent parameters of the model such as weights and biases, and output type (RESULT): representing the computation result of the operator.
[0036] (3) Type: Represents the data type (DataType) and shape dimension of the tensor.
[0037] (4) Define the relationship (def): For tensor values of type RESULT, establish a producer association by referencing the source operator that created the tensor value.
[0038] (5) Uses: Stores the set of operators that take the current output tensor as input.
[0039] Specifically, the operator operation structure Op is the basic unit of the computation graph, and its key properties include: (1) Name: The unique identifier of the operator.
[0040] (2) Type: Operator type, such as Convolution (Conv), Pooling (Pool), Activation (ReLU), etc.
[0041] (3) Input tensor set (inputs): A set of all input tensor data (ValueT) of this operator, which are derived from INPUT, PARAM or RESULT of other operators.
[0042] (4) Output Tensor Set (outputs): A set of all output tensor data (ValueT) of this operator, where the kind of these values is RESULT.
[0043] (5) Attribute set: Stores the specific rules and parameters of this operator in key-value pairs, such as kernel size, stride, padding, etc.
[0044] (6) Predecessor set (preds): The set of predecessor operators obtained based on the def relation of the input tensor, used to construct computational dependencies.
[0045] (7) Supplementary Operation Set (succs): The set of successor operators pointed to by the uses of the output tensor, used to construct the data flow between operators.
[0046] Specifically, the computation graph structure is used to represent the node information and data flow information of the CNN computation graph. Its key properties include: (1) Name: A string that identifies the computation graph.
[0047] (2) Input tensor set (inputs): Stores the input data nodes of the model, and the type is ValueT.
[0048] (3) Output Tensor Set (outputs): Stores the output data nodes of the model, with a type of ValueT.
[0049] (4) Parameter set (params): Stores persistent parameters such as weights and biases of the model, and is of type ValueT.
[0050] (5) Operator set (ops): Stores the set of computation nodes that constitute the computation graph. The type is Vertex, where Vertex is a unified abstract base class for operator operation Op and input node Input.
[0051] In specific application examples, the standardized ONNX model file (.onnx) can be read and deserialized using the ONNX (Open Neural Network Exchange) parser, and its contents can be loaded into the onnx::ModelProto structure. onnx::ModelProto is the core data structure of the ONNX standard, fully describing the model's topology, parameters, and metadata. Subsequently, based on the data in ModelProto, the aforementioned Graph structure computation graph is constructed.
[0052] Step S2: Optimize the initial computation graph according to the predefined operator fusion rules to obtain the initial optimized computation graph.
[0053] This embodiment uses predefined operator fusion rules to fuse adjacent, mergeable operators in the initial computation graph to obtain the initial optimized computation graph.
[0054] Specifically, the operator fusion rules can include the following: fusion of convolution and activation function (Conv+ReLU), with the fused operator type being ConvReLU; fusion of convolution, activation function, and normalization (Conv+ReLU+LRN), with the fused operator type being ConvReluLRN; fusion of pooling and normalization (MaxPool+LRN), with the fused operator type being MaxPoolLRN; fusion of tensor addition operator and activation function (Add+ReLU), with the fused operator type being AddReLU; and fusion of matrix multiplication and activation function (GEMM+ReLU), with the fused operator type being GemmReLU.
[0055] It is understandable that, in addition to the rules mentioned above, other fusion rules between operator types can be configured according to actual needs.
[0056] Step S3: Perform memory analysis on the current computation graph to identify memory hotspots during the CNN model inference process and obtain a memory hotspot subgraph.
[0057] In this embodiment, a memory hotspot refers to the set of nodes corresponding to all simultaneously active tensors at a certain moment in the execution of the CNN computation graph scheduling sequence, and the total memory usage of this set reaches the peak of the entire scheduling sequence.
[0058] As an optional implementation, the specific steps for identifying memory hotspots and obtaining a memory hotspot subgraph include: Step S3.1: Constructing the Tensor Active Table: Constructing the tensor active table based on the scheduling sequence of the computation graph. The tensor active table Used to record the effective lifespan of each tensor during the execution of the CNN model, the scheduling sequence is a computation graph topology ordering that satisfies the data dependencies of the CNN computation graph.
[0059] For example, a tensor activity table can be constructed using the following steps. : Step S3.1.1: Based on a scheduling sequence sched from the computation graph, calculate the lifecycle active interval of each operator. , ],in Indicates the start time, that is, the time point when the operator begins execution (scheduling sequence number); The end time indicates the scheduling time when the output tensor generated by the operator is last used by its downstream consumer operator; the active interval [ , The symbol indicates the timeframe during which the operator must remain active in memory. Step S3.1.2: Based on the active range of each operator [ , Construct a tensor active table ,in, Represents the set of operators in a computation graph. Indicates the number of operators. Indicates the scheduling time; the rows of the tensor active table A represent operator nodes, and the columns correspond to timestamps and elements in the active table. The value can be: (1) in, Operators timestamp It is active (its output tensor needs to be kept in memory), otherwise, 0 indicates an operator. timestamp It is inactive. , Operators The start and end times of the active range.
[0060] Step S3.2: Calculate the memory usage at each scheduling time based on the tensor activity table constructed in step S3.1.
[0061] Specifically, based on the tensor activity table constructed in step S3.1 Summing the sizes of the active tensors at each time step yields the value at any time step. memory usage for: (2) in, Represents the set of operators in a computation graph. Indicates the number of operators. express The memory footprint of the output tensor.
[0062] Step S3.3: Obtain the memory hotspots of the computation graph: query the peak values from the memory usage at each scheduling time. The peak memory usage during the execution of the computation graph is represented by a corresponding value. Timestamps in memory Query the tensor active table middle The set of target tensors that are active at all times is obtained, and the Op operations corresponding to each tensor in the target tensor set are obtained to obtain the memory hotspot subgraph.
[0063] Specifically, each scheduling moment Maximum memory usage This value represents the peak memory usage during the execution of the computation graph, and the timestamp corresponding to the peak memory usage is... Then query the active table. middle By obtaining the Op operation corresponding to a constantly active set of tensors, a memory hotspot subgraph can be obtained. .
[0064] Step S4: Starting from the minimum fission factor, perform fission on the memory hotspot subgraph, and search iteratively to find the GSM (Global Shared Memory) capacity constraint that satisfies the target vector accelerator. The minimum feasible fission factor.
[0065] fission factor This refers to splitting the original memory hotspot subgraph along a specific dimension (usually the spatial height dimension) to reduce the memory footprint of executing the hotspot subgraph. When When =1, it means that the subgraph structure remains unchanged; When the value is greater than 1, the atomic diagram will be reconstructed from... A composite structure consisting of copies of subgraphs with identical topological structures. In this embodiment, after obtaining the memory hotspot subgraph, the minimum fission factor is used. Starting with =1, an iterative search is used to dynamically find GSM capacity constraints that meet the current hardware requirements. Minimum feasible fission factor This enables adaptive solving of the fission factor for memory hotspot subgraphs, allowing for dynamic solving of the optimal fission factor based on the specific memory capacity constraints of the target vector accelerator, thereby generating customized storage allocation schemes and improving cross-platform applicability.
[0066] As an optional implementation, an iterative search is used to find the GSM capacity constraints. The specific steps for determining the minimum feasible fission factor are as follows: Step S4.1: Classify computation graph nodes: Based on the relative position of the nodes in the memory hotspot subgraph in the data flow, the nodes in the memory hotspot subgraph are divided into three categories: root node, middle node, and leaf node. Based on the computational characteristics of the operators, the operators are divided into sliding window operators and element-wise operators.
[0067] Specifically, the root node is located at the top of the subgraph data flow, with no predecessor node or all predecessor nodes located outside the current hot subgraph, and is responsible for receiving input data from external or other subgraphs; the middle node is located in the middle of the subgraph, and has both an upstream predecessor node and a downstream successor node, and undertakes the main computation and data transformation tasks within the subgraph; the leaf node is located at the bottom of the subgraph data flow, with no successor node or all successors located outside the current hot subgraph, and its output will be used as the final result or passed to other subgraphs.
[0068] Specifically, sliding window operators involve dense sampling and computation within a local neighborhood (sliding window) of the input feature map, such as convolution operators (Conv) and pooling operators (MaxPool / AvgPool). Element-wise operators operate independently only on elements of the input tensor at the same spatial location, without involving cross-location data sampling, and the output tensor has the same dimension as the input tensor; these include tensor addition operators (Add) and activation functions (ReLU).
[0069] Step S4.2: Reconstruct the height dimension of the memory hotspot tensor: Based on the classification of nodes in step S4.1, obtain the category of each operator, and perform a reverse traversal from the leaf node to the root node, based on the current fission factor. Recursively recalculate the output tensor height of each operator. With input tensor height fission factor The initial value is the minimum fission factor, and the height dimension of the input tensor and output tensor of each operator is calculated iteratively and recursively until the height dimension of the input and output tensors of all nodes has been recalculated, thus obtaining the height dimension reconstruction result of the memory hotspot tensor.
[0070] This embodiment requires fission based on the tensor height dimension. Therefore, in order to reconstruct the height dimension of the memory hotspot tensor, the nodes are first classified according to step S4.1 to obtain the category of each operator, and the output tensor height of each operator is recalculated. With input tensor height Specifically, it includes: For leaf nodes: height of the output tensor of the sliding window operator Where ceil is the floor function. Indicates the current fission factor, input tensor height. ,in, Step size, This represents the kernel size. Furthermore, if the current fission copy is at the beginning or end of the sequence (i.e., the first or last copy), the impact of padding on the effective input region needs to be considered, and boundary padding correction is performed: the input height calculation is corrected to... ,in For the fill size. Furthermore, if (original input height + 2P - K) cannot be... If divisible, perform step-size divisibility correction: adjust the input height... Add 1.
[0071] For element-wise operator types The input tensor height is .
[0072] For sliding window operators at intermediate and root node positions, the output tensor height equals the input height of the downstream operator, and the input tensor height equals the input tensor height. Similarly, boundary filling correction and step size divisibility correction are also required.
[0073] For element-wise operator types at intermediate and root node positions, output tensor height. Input tensor height .
[0074] Step S4.3: Fission of memory hotspot subgraph and reconstruction of computation graph: Based on the height dimension reconstruction result obtained in step S4.2, the memory hotspot subgraph in the original computation graph old_graph is split and reconstructed to generate an optimized new computation graph new_graph.
[0075] As an optional implementation, the specific steps for splitting the memory hotspot subgraph and reconstructing the computation graph include: Step S4.3.1: Initialize the new computation graph: Copy the original computation graph old_graph to generate the initial new computation graph new_graph. According to the new tensor dimensions reconstructed in step S4.2, traverse and update the input and output tensor dimensions of the operators in the memory hotspot subgraph of the current new computation graph new_graph. Use the corresponding memory hotspot in the current new computation graph new_graph structure as the first copy of the memory hotspot subgraph after fission.
[0076] Specifically, the original computation graph `old_graph` is copied to generate an initial new computation graph: `new_graph = old_graph.Clone()`, where `Clone()` is the graph copy function. Further, based on the new tensor dimensions calculated in step S4.2, the input and output tensor dimensions of the operators in the memory hotspot subgraphs of the new computation graph `new_graph` are traversed and updated, and the corresponding memory hotspots in the current structure of `new_graph` are used as the first copy of the memory hotspot subgraph after fission.
[0077] Step S4.3.2: Insert Concat operator: Insert a Concat operator after the leaf node of the first replica of the memory hotspot subgraph to be used as a concatenation operator to merge the outputs of multiple fission replicas.
[0078] Specifically, the implementation steps for inserting the Concat operator are as follows: Get the leaf node `leaf_op` of the memory hotspot subgraph; Create a new concatenation operator concat_op based on the leaf node leaf_op, and set its name to "concat_" + leaf_op->name, with type "Concat".
[0079] Establish a data stream connection: Use the output tensor of the leaf node leaf_op as the input of the concatenation operator concat_op.
[0080] Set the leaf node leaf_op as the predecessor node of the concatenation operator concat_op.
[0081] Set the concat_op operator as the successor node of the leaf node leaf_op.
[0082] Connect the output of the concat_op operator to the downstream user of the original leaf node leaf_op.
[0083] Insert the concat_op operator into the operation set ops of the new graph.
[0084] Step S4.3.3: Construct the mapping tables of the computation graph: Construct a vertex mapping table nameToVertex to store the mapping from vertex names to vertex pointers in the graph, construct an operator mapping table nameToOp to store the mapping from operator names to operator pointers, and construct a tensor mapping table nameToValue to store the mapping from all tensor names to tensor pointers.
[0085] Specifically, a vertex mapping table `nameToVertex` is constructed in the computation graph to store the mapping from vertex (input, output, operator) names to vertex pointers. All input nodes of the computation graph are traversed, and the name of each input tensor is mapped to its corresponding input vertex `Vertex`, where `Vertex` represents a pointer to the node. Further, all output nodes of the computation graph are traversed, and the name of each output tensor is mapped to its corresponding output vertex. Finally, all operator nodes of the computation graph are traversed, and the name of each operator is mapped to its operator vertex.
[0086] Furthermore, an operator mapping table `nameToOp` is constructed to store the mapping from operator names to operator pointers. All operator nodes are traversed to establish the mapping from operator names to operator pointers (`Op`).
[0087] Furthermore, a tensor mapping table `nameToValue` is constructed to store the mappings from tensor names to tensor pointers. For each operator, all input tensors (except for parameter types) are traversed, and a mapping is established between tensor names and tensor pointers. Similarly, for each operator, all output tensors are traversed, and a mapping is established between tensor names and tensor pointers.
[0088] Step S4.3.4: Create and connect new subgraph copies: if the fission factor of the current memory hotspot subgraph is... When the value is greater than 1, new memory hotspot subgraph copies are generated and connected sequentially.
[0089] Specifically, starting from the second memory hotspot copy (split_idx=2) to the... Each memory hotspot copy is used to generate a new memory hotspot subgraph copy in turn. The fission factor of the memory hotspot subgraph is... When >1, create and connect the remaining ones. A new memory hotspot subgraph copy. Optionally, the specific steps for generating and connecting the new memory hotspot subgraph copy are as follows: Create a new operator node: For each memory hotspot subgraph replica, traverse the hotspot subgraph in data flow hierarchy order, create an OP-type replica operator new_op for the original operator original_op, and set the new operator name. For example, set the new operator name to new_op->name = original operator name original_op->name + "_split" + replica number.
[0090] Create new input tensors for new operator nodes: Create a new tensor `new_input` for each non-parametric input of the replica operator `new_op`, set the tensor name, establish the definition relationship with the corresponding upstream operator of the replica, and assign the new input tensor to the input of the new operator. For example, set the current tensor name to the original input tensor name + "_split" + replica index, and establish the definition relationship with the corresponding upstream operator of the replica as: `new_input->def = nameToOp[upstream operator name + "_split" + replica index]`. Further, assign the new input tensor to the input of the new operator: `new_op->inputs.push_back(new_input)`.
[0091] Create new operator node output tensors: Create a new tensor new_output for each output of the copy operator new_op and set the name of the new output tensor, copy all attributes of the original output, establish the output tensor definition relationship, and add the new output tensor to the output tensor set of the new operator new_op. For example, when creating a new tensor new_output for each output of new_op, set the name of the new output tensor to the original output name original_op->outputs[0]->name + "_split" + copy number; further, copy all attributes of the original output (data type, shape, etc.) and establish the output tensor definition relationship: new_output->def = new_op; further, add the new output tensor to the output tensor set of the new operator new_op: new_op->outputs.push_back(new_output).
[0092] Establish connections between operators: Upstream connection: Connects the input of the replica operator new_op to the output of the upstream operator of the corresponding replica.
[0093] Downstream connections: The output of the replica operator `new_op` is connected to the input of the downstream operator of the replica. If it is a leaf node replica, it is connected to the input of the concatenation operator `concat_op`.
[0094] Update computation graph: Insert the newly created operator into the current new computation graph new_graph.
[0095] Synchronously update the vertex mapping table, operator mapping table, and tensor mapping table: Synchronously update the newly created operator and input / output tensor information to the vertex mapping table, operator mapping table, and tensor mapping table.
[0096] Step S4.3.5: Update the predecessor and successor relationships of each operator in the new computation graph new_graph to generate the optimized new computation graph new_graph.
[0097] Specifically, after completing the creation and connection of replicas, the predecessor and successor relationships of relevant operators in the new graph are updated uniformly, including: Update successor relationships: Update the list of successor vertices of each operator based on the list of actual users of the output tensor of each operator.
[0098] Update predecessor relationships: Update the list of predecessor vertices for each operator based on the actual producer (definer) of the input tensor.
[0099] After the above reconstruction of the computation graph, the portion of the computation graph new_graph corresponding to the original memory hotspot subgraph has been transformed into: a) Contains n' copies of memory hotspot subgraphs with the same topology.
[0100] b) The outputs of each subgraph copy are concatenated along the height dimension at the end of the subgraph using the Concat operator. The output tensor size of the Concat operator is the same as the output tensor size of the original leaf operator.
[0101] Step S4.4: Iterative Search and Decision: Calculate the peak memory usage of the optimized new computational graph new_graph Compare the current peak memory usage. With current hardware GSM capacity constraints ,like Then record the current fission factor. To meet the current hardware capacity constraints of GSM If the minimum feasible fission factor is found, proceed to step S5; otherwise, let... Then return to step S4.2 to continue the search.
[0102] Specifically, the memory usage at each scheduling moment can be calculated according to steps S3.1 and S3.2, and then the peak memory usage of the new computation graph new_graph can be calculated. Compare the current peak memory usage. With target memory constraints ,like Then record the current fission factor. And the corresponding new computation graph, proceed to step S5; otherwise, let Then return to step S4.2 to continue the search.
[0103] Step S5: Construct a hardware-aware fission cost model: Construct a cost model of the CNN model operator to quantify the execution time overhead of the memory hotspot subgraph. Calculate the time overhead of the memory hotspot subgraph before fission and the execution time overhead after fission when fissuring according to the minimum feasible fission factor based on the cost model of the CNN model operator. Determine the fission decision for the memory hotspot subgraph based on the cost before fission and the execution time overhead after fission. Apply the determined fission decision to the computation graph to obtain the current optimized computation graph.
[0104] In this embodiment, a cost model of CNN model operators is constructed to quantify the execution time overhead of memory hotspot subgraphs. Then, the time overhead cost of memory hotspot subgraphs before fission and the execution time overhead after fission are calculated. A hardware-aware fission cost model can be constructed. Based on this model, fission decisions can be determined for memory hotspot subgraphs according to the cost before fission and the execution time overhead after fission.
[0105] As an optional implementation method, the specific steps include: Step S5.1: Define hardware parameters: Use the key hardware parameters of the target vector accelerator as the basic parameters for cost modeling, where the hardware parameters include: peak floating-point performance. Total capacity of Global Shared Memory (GSM) With access bandwidth Off-chip DDR memory access bandwidth wait.
[0106] Step S5.2: Establish the cost model of CNN model operators: To quantify the execution time overhead of memory hotspot subgraphs, an execution overhead model of various operators within the memory hotspot subgraph is established. This operator cost model includes two parts: computational overhead and memory access overhead. The computational overhead is determined by the floating-point operation volume of the operator. The decision is based on the peak performance of the hardware computing units. It is estimated that memory access overhead is determined by the amount of data transferred between different levels of storage, including input tensors, weight parameter tensors, and output tensors.
[0107] Specifically, for common operator types in memory hotspot subgraphs, the calculation model for floating-point operation complexity and data transmission volume can be adopted as follows: Convolution operator Conv: Floating-point computation FP = Ho × Wo × F × K² × C × 2 (multiplication and addition are performed once each). Input data size = Ho × Wo × K² × C (considering convolution memory access characteristics). Weight data volume: =F×C×K^2. Output data volume: = Ho×Wo×F. Where, , To output the height and width of the tensor, Input the number of channels. Number of output channels The size of the convolution kernel or pooling kernel.
[0108] Pooling operator Pool: Floating-point computation complexity: FP = Ho × Wo × C × K^2. Input data size: = Ho × Wo × K^2 × C. Output data volume: = Ho×Wo×C.
[0109] The tensor addition operator Add: The floating-point computational complexity is FP = Ho × Wo × C. Input data size: = Ho × Wo × C. Output data volume: = Ho×Wo×C.
[0110] Tensor concatenation operator Concat: Floating-point operation complexity: FP = 0; Input data size: = Ho×Wo×C; Output data volume: = Ho×Wo×C.
[0111] Step S5.3: Calculate the cost of the memory hotspot subgraph before splitting. Based on the operator cost model in step S5.2, sum the floating-point computation cost and data transfer cost of each operator in the memory hotspot subgraph to constitute the cost of the memory hotspot subgraph before splitting. It should be noted that the amount of input / weight / output data transmitted at this time is stored in DDR.
[0112] Specifically, the time cost of storing hotspot subgraphs before fission. The calculation expression is: (3) in, This represents the peak floating-point computing performance of the vector accelerator. Indicates DDR memory access bandwidth. For the operators corresponding to the memory hotspot subgraph, Represents the first [number] in the memory hotspot subgraph. An operator, Indicates the number of bytes occupied by the data type. express The floating-point computational complexity of the operator express Memory usage of operator weight tensors express Memory usage of operator input tensors express The memory usage of the operator output tensor.
[0113] Step S5.4: Calculate the execution time cost of the memory hotspot subgraph after fission. .
[0114] Specifically, This includes the additional computational and data transfer overhead introduced by fission, where intermediate tensors are stored in GSM. Execution time overhead. The calculation expression is: (4) in, This represents the peak floating-point computing performance of the vector accelerator. Indicates DDR memory access bandwidth. Indicates GSM storage bandwidth. Indicates the first part The floating-point computational complexity of the operator Indicates the first part Memory usage of operator weight tensors No. part Memory usage of operator input tensors No. part The memory usage of the operator output tensor, where N represents the number of sub-plots generated after fission.
[0115] Step S5.5: Define the hardware-aware fission decision model: compare the execution time overhead before and after fission, and provide a clear fission decision for each memory hotspot subgraph.
[0116] Specifically, if the time cost after the fission is less than the execution time cost before the fission, that is... If the time cost after the fission is greater than or equal to the execution time cost before the fission, then the fission process is performed on the subgraph; otherwise, if the time cost after the fission is greater than or equal to the execution time cost before the fission, then the fission process is performed on the subgraph. If so, then "No-Fission" will not be performed on that subgraph. That is, the hardware-aware fission model is as follows: (5) in, and These are memory hotspot subgraphs. The time cost before fission and the execution cost time after fission.
[0117] Step 6: Determine the storage allocation strategy for each intermediate tensor in the memory hotspot subgraph based on the fission decision determined in step S5.
[0118] As an optional implementation method, the specific steps include: Step S6.1: Define a tensor storage allocation map to record the final storage allocation decision for each tensor. Each record in the tensor storage allocation map corresponds to a tensor in the computation graph.
[0119] Specifically, each record in the mapping table corresponds to a tensor in the computation graph and can contain the following core fields: Sequence Number: Used to uniquely identify each record in the table; Tensor Identifier: The name of the tensor; Decision Result: Represents the fission decision of the entire memory hotspot subgraph associated with this tensor, with two possible values: Fission: Indicates that the subgraph containing this tensor adopted fission optimization. No-Fission: Indicates that the subgraph containing this tensor did not adopt fission optimization and maintains its original structure; Storage Location: Specifies the final physical storage location of this tensor on the target hardware, also with two possible values: GSM: Indicates that the tensor will be allocated to high-speed static memory; DDR: Indicates that the tensor will be allocated to low-speed external memory; Other descriptive fields, i.e., optional text fields, can also be included to record additional information.
[0120] Step S6.2: Based on the fission decision determined in step S5, if the decision result of the memory hotspot subgraph is fission "Fission", then update the tensor storage allocation mapping table, mark the decision result of the intermediate tensor generated by the subgraph as "Fission", and mark the storage location as "GSM"; if the decision result of the memory hotspot subgraph is no fission ("No-Fission"), update the tensor storage allocation mapping table, mark the decision result of the intermediate tensor generated by the subgraph as no fission ("No-Fission"), and mark the storage location as "DDR". Step S7: Global convergence determination and final solution output.
[0121] Specifically, after completing the decision and mapping table update for the current memory hotspot subgraph, the decision result (i.e., whether to perform fission) is applied to the global computation graph to generate a partially optimized intermediate computation graph; subsequently, it is determined whether the global memory peak of the current computation graph still exceeds the memory constraint. If the condition is exceeded, return to step S3, and start the next round of memory hotspot identification and optimization iteration with the intermediate computation graph as input; if the condition is met, the iteration process terminates, and the optimized computation graph and intermediate tensor storage allocation mapping table are output.
[0122] This invention identifies memory hotspots that cause memory bottlenecks and determines fission decisions based on memory hotspot subgraphs. This ensures that fission optimization directly addresses performance bottlenecks and effectively avoids ineffective optimization operations on non-critical operators. In determining fission decisions for memory hotspot subgraphs, a hardware-aware cost model is constructed using the time overhead of the memory hotspot subgraph before and after fission. The fission decision is determined based on the time overhead before and after fission, elevating the fission decision criterion from "whether the intermediate tensor can be placed in memory" to "whether it can improve inference speed." This allows fission to be performed only when it brings net performance benefits, solving the performance degradation problem caused by blind fission in traditional methods. Furthermore, by dynamically searching for the minimum fission factor that satisfies a given GSM capacity constraint, it supports fission for sliding windows and element-wise operators, and can automatically generate optimal graph transformation and storage allocation schemes for vector accelerators with different memory configurations.
[0123] The following example, deploying a CNN model named testModel.onnx to a target vector accelerator, further illustrates this invention. The key parameters of the target hardware are set as follows: 64GB of off-chip DDR storage, 200KB of global shared memory (GSM), 32GB / s of DDR memory access bandwidth, 80GB / s of GSM memory access bandwidth, and a peak performance of 128 GFLOPs for the computing unit.
[0124] Step S1: Construct the intermediate representation structure of the computation graph to obtain the initial computation graph.
[0125] like Figure 2 The diagram shows the original computation graph of the testModel.onnx model, which contains 12 operators: convolution operator (conv1), activation operator (relu1), convolution operator (conv2), activation operator (relu2), max pooling operator (maxpool1), convolution operator (conv3), activation operator (relu3), convolution operator (conv4), activation operator (relu4), max pooling operator (maxpool2), flatten operator (flatten1), and fully connected operator (gemm). In the convolution operator, w and b represent the weight tensor and bias tensor, respectively. Figure 3 The graph structure is shown, with attributes including: graph name "testModel", graph inputs "input", outputs "output", params representing weight parameters in the graph, and ops representing operations in the graph.
[0126] Step S2: Optimize the initial computation graph according to the predefined operator fusion rules.
[0127] By applying predefined operator fusion rules, consecutive "Conv" and "ReLU" operators are fused into a single "ConvReLU" operator. The resulting computation graph is reduced to eight operators, such as... Figure 4 As shown. Specifically, it includes: a first convolution-activation fusion operator (conv1+relu1) formed by fusing convolution operator conv1 and activation operator relu1; a second convolution-activation fusion operator (conv2+relu2) formed by fusing convolution operator conv2 and activation operator relu2; a first max pooling operator maxpool1; a third convolution-activation fusion operator (conv3+relu3) formed by fusing convolution operator conv3 and activation operator relu3; a fourth convolution-activation fusion operator (conv4+relu4) formed by fusing convolution operator conv4 and activation operator relu4; a second max pooling operator maxpool2; a flattening operator flatten1; and a fully connected operator gemm. The output tensors of the first convolution activation fusion operator (conv1+relu1), the second convolution activation fusion operator (conv2+relu2), the first max pooling operator (maxpool1), the third convolution activation fusion operator (conv3+relu3), the fourth convolution activation fusion operator (conv4+relu4), and the second max pooling operator (maxpool2) are denoted by A, B, C, D, E, F, and G, respectively. Figure 5 The Op structure of the operator "conv1+relu1" is shown, with the following attributes: the operation name is "conv1+relu1", the operation type is "ConvReLU"; the input tensor of the operation is {A,w1,b1}, the output tensor is {B}, and the attributes include the window size kernels, pads, strides, group, and dilations; the predecessor operation preds={input}; and the successor operation succs is {conv2+relu2}.
[0128] Step S3: Identify memory hotspots during the CNN model inference process and obtain a memory hotspot subgraph.
[0129] A scheduling sequence is generated from the fused computation graph, and the memory usage at each scheduling moment is statistically analyzed. The instantaneous peak memory usage during model execution is calculated to be 512KB, exceeding the 200KB GSM constraint. By extracting the active tensors at the peak moment through two iterations, two key memory hotspot subgraphs are identified, such as... Figure 6 As shown, they are denoted as subgraphs. subgraph1(Including conv1+relu1, conv2+relu2, maxpool1 operators) and subgraphs subgraph2 (Including conv3+relu3, conv4+relu4, maxpool2 operators).
[0130] Step S4: Adaptively solve the fission factor of the memory hotspot subgraph: Starting from the minimum fission factor, the memory hotspot subgraph is fissaged, and an iterative search is conducted to find the one that satisfies the current GSM capacity constraint. The minimum feasible fission factor.
[0131] Using two different examples, we will explain in detail how to perform adaptive fission solution and cost evaluation on hotspot subgraphs.
[0132] Example 1: Subgraph subgraph1 Fission decision (positive returns, choose fission) The system is subgraph1 The search for feasible fission factors begins with n=1. During the iterative search, when attempting fission factor n=4, the system performs the following core operation: The application step S4.2 reconstructs the tensor height dimension of the memory hotspot. The system is based on... n =4 Pair diagram subgraph1 Perform dimensional reconstruction. subgraph1 The data flow is conv1+relu1 → conv2+relu2 → maxpool1. Based on the node classification in step S4.1: maxpool1 is a leaf node, conv2+relu2 is an intermediate node, and conv1+relu1 is the root node. All three are sliding window operators. Next, starting from the leaf node maxpool1, the dimensions are reconstructed in reverse (upstream): 1. For the leaf node maxpool1: its original output height is 16. According to the rule, the output height Ho of the leaf node = ceil(original output height / n) = ceil(16 / 4) = 4.
[0133] 2. For the intermediate node conv2+relu2: its downstream operator is maxpool1. Therefore, the output height Ho of conv2+relu2 should be equal to the input height of maxpool1. Based on the pooling parameters (assuming kernel size K=2, stride S=2, padding P=0), we can deduce: Ho(conv2) = (Ho(maxpool1) - 1) S + K = (4 - 1) 2 + 2 = 8.
[0134] 3. Obtain the attribute parameters of conv2+relu2: K=3, S=1, P=1. Input the height formula according to the sliding window operator: Hi = (Ho - 1) S + K, we can calculate Hi(conv2) = (8 - 1) 1 + 3 = 10. Step-size division correction is needed: determine if (original input height + 2P - K) / S is divisible. Assuming the original total input height is 32, then (32 + 2P - K) / S is divisible by 10. 1 - 3) / 1 = 31, which is divisible, so Hi(conv2) remains 10.
[0135] 4. For the root node conv1+relu1, its downstream operator is conv2+relu2. Therefore, the output height Ho(conv1) of conv1+relu1 should be equal to the input height of conv2+relu2, that is, Ho(conv1) = Hi(conv2) = 10.
[0136] 5. Obtain the attribute parameters of conv1+relu1: K=3, S=2, P=1. Calculate its input height: Hi(conv1) =(10 - 1) 2 + 3 = 21. Perform step-size division correction (original input height is 64): (64 + 2) 1 - 3) / 2 = 31.5, which is not divisible, so Hi(conv1) = Hi(conv1) + 1 = 21 + 1 = 22.
[0137] At this point, the sub-diagram subgraph1 exist n When =4, the input / output height after each operator fission is as follows: Figure 7 As shown.
[0138] Furthermore, based on the aforementioned new dimension, four structurally identical subgraph copies are created, and a Concat operator is inserted at the end to complete graph reconstruction and generate candidate fission subgraphs. subgraph1 '.
[0139] Furthermore, the peak memory usage of the current computation graph is calculated. The system calculates the peak memory usage of the current memory hotspot subgraph for a fission factor of n=4: = 192KB, less than hardware constraints = 200KB. Therefore, n =4 was determined to be a feasible solution. The system records this fission factor and the corresponding candidate subgraph. subgraph1 ', and before that, n =1,2,3, etc., due to the assessed values Sizes larger than 200KB were phased out by system iterations.
[0140] Steps S5-S6: Based on the hardware-aware fission cost model, determine the fission decision for the memory hotspot subgraph and determine the intermediate tensor allocation storage according to the cost before fission and the execution time overhead after fission.
[0141] Based on the solved feasible fission factor that satisfies the memory constraint The system calls the hardware-aware cost model to quantitatively evaluate the performance before and after the fission.
[0142] The system first calculates the computational cost and memory access cost of each operator based on the operator cost model defined in step S5.1. The specific calculation process is as follows: (1) Conv1+relu1 operator: Parameters: K=3, Hi=64, Ho=32, C=3, F=64.
[0143] The number of floating-point operations (FP) is: Ho × Wo × F × K² × C × 2 = 32 × 32 × 64 × 3² × 3 × 2 = 3,538,944 FLOPs.
[0144] Computational overhead = = 3,538,944 / (128 × 10^9) × 10^3 ≈0.027648 ms.
[0145] Input data volume = Ho × Wo × K² × C = 32 × 32 × 3² × 3 = 27,648 elements.
[0146] Weighted data volume = F × C × K² = 64 × 3 × 3² = 1,728 elements.
[0147] Output data volume = Ho × Wo × F = 32 × 32 × 64 = 65,536 elements.
[0148] Total DDR accesses (bytes) = (27,648 + 1,728 + 65,536) × 4 Bytes = 379,648 Bytes.
[0149] Memory access overhead = Total visits / = 379,648 / (32 × 10^9) × 10^3 ≈0.011864 ms.
[0150] Estimated execution time ≈ max( , ) ≈ max(0.027648, 0.011864)≈ 0.027648 ms.
[0151] (2) Conv2+relu2 operator: Parameters: K=3, Hi=32, Ho=32, C=64, F=64.
[0152] The number of floating-point operations (FP) is 32 × 32 × 64 × 3² × 64 × 2 = 75,497,472 FLOPs.
[0153] Computational overhead = 75,497,472 / (128 × 10^9) × 10^3 ≈ 0.590016ms.
[0154] Input data volume = 32 × 32 × 3² × 64 = 589,824 elements.
[0155] Weighted data volume = 64 × 64 × 3² = 36,864 elements.
[0156] Output data volume = 32 × 32 × 64 = 65,536 elements.
[0157] Total DDR accesses (bytes) = (589,824 + 36,864 + 65,536) × 4 = 2,768,896 Bytes.
[0158] Memory access overhead = 2,768,896 / (32 × 10^9) × 10^3 ≈ 0.086528 ms.
[0159] Estimated execution time ≈ max( , ) ≈ max(0.590016, 0.086528)≈ 0.590016 ms.
[0160] (3) Maxpool1 operator: Parameters: K=2, Hi=32, Ho=16, C=64.
[0161] The floating-point operation quantity FP = Ho × Wo × C × K² = 16 × 16 × 64 × 2² = 65,536 FLOPs.
[0162] Computational overhead = 65,536 / (128 × 10^9) × 10^3 ≈ 0.000512 ms.
[0163] Input data volume = Hi × Wi × C = 32 × 32 × 64 = 65,536 elements.
[0164] Output data volume = Ho × Wo × C = 16 × 16 × 64 = 16,384 elements.
[0165] Total DDR accesses (bytes) = (65,536 + 16,384) × 4 = 327,680 Bytes.
[0166] Memory access overhead = 327,680 / (32 × 10^9) × 10^3 ≈ 0.010240 ms.
[0167] Estimated execution time ≈ max( , ) ≈ max(0.000512, 0.010240)≈ 0.010240 ms.
[0168] The total cost is the sum of the times of each operator, i.e. =0.027648 + 0.590016 + 0.010240 ≈0.6279 ms.
[0169] for subgraph1 fission factor =4, its cost evaluation applies the same calculation rules, but there are two differences: (1) The input / output height of each operator is calculated according to Figure 7 The dimensional reconstruction results are shown; (2) The access bandwidth of the intermediate tensor becomes the higher GSM bandwidth. The total execution overhead after fission is calculated by the model. ≈ 0.6268ms.
[0170] Decision: Since the cost after fission is less than the cost before fission, fission brings performance benefits. Therefore, the system makes the fission decision "Fission". (Computational graph) subgraph1 The part was replaced with subgraph1', where the intermediate tensor is marked as "assigned to GSM" in the storage allocation table.
[0171] Example 2: Pair diagram subgraph2 The decision (negative returns, choose not to split) For subgraph subgraph2 The system uses the same hardware-aware cost model and evaluation rules as Example 1 to quantitatively evaluate its performance before and after fission.
[0172] The system first provides subgraph2 A search was conducted for feasible fission factors, ultimately identifying those that satisfied the GSM capacity constraints. =2, and its dimension reconstruction result is as follows: Figure 8 As shown.
[0173] The cost model evaluation results are as follows: pre-fission overhead =0.889856 ms. Post-fission cost. ≈ 0.923648 ms. At this point, the overhead after fission is greater than the overhead before fission, and fission actually increases the execution time. Therefore, the system makes the "No-Fission" decision not to fission, retaining... subgraph2 The original structure, in which the intermediate tensor is marked as "allocated to DDR" in the storage allocation table.
[0174] Step S7: Output the optimized CNN computation graph and tensor storage allocation mapping table.
[0175] After processing all hotspot subgraphs, the final optimized computation graph and the final storage allocation scheme are generated. Figure 9 The optimized final computational graph, where the subgraphs... subgraph1 It will perform fission to become a subgraph subgraph1 ', Subgraph subgraph1 'Contains four subgraphs' subgraph1 (Including conv1+relu1, conv2+relu2, maxpool1 operators). And due to the subgraph calculated by the hardware-aware model... subgraph2 ( Includes conv3+relu3, conv4+relu4, and maxpool2 operators. ) The cost of fission is greater than the cost before fission, therefore for subgraphs... subgraph2 , does not perform fission. Figure 10 For the generated storage allocation scheme. subgraph1 The operator performs fission, and the resulting intermediate tensors are stored in GSM, while for subgraph2 The operator does not perform fission, and the resulting intermediate tensor is stored in DDR. Other non-memory hotspot operators have lower memory usage than GSM, so they are stored in GSM.
[0176] This embodiment further provides a computer device, including a processor and a memory, the memory being used to store a computer program, and the processor being used to execute the computer program to perform the method as described above.
[0177] Those skilled in the art will understand that the above embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code. The present invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, produce implementations of the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 The computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The functions specified in one or more boxes. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable apparatus for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0178] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention in any way. Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the invention. Therefore, any simple modifications, equivalent changes, and alterations made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention should fall within the protection scope of the present invention.
Claims
1. An adaptive memory allocation method for CNNs oriented towards vector accelerator inference, characterized by the following steps: include: Step S1: Obtain the original CNN model for analysis, construct the intermediate representation structure of the computation graph to obtain the initial computation graph. The intermediate representation structure includes a ValueT structure for representing tensor data, an Op structure for representing computation operations, and a Graph structure for representing the computation process and data dependencies. Step S2: Optimize the initial computation graph according to the predefined operator fusion rules to obtain the initial optimized computation graph; Step S3: Perform memory analysis on the current computation graph to identify memory hotspots during the CNN model inference process and obtain a memory hotspot subgraph; Step S4: Starting from the minimum fission factor, the memory hotspot subgraph is fissullied, and the GSM capacity constraint that satisfies the target vector accelerator is found through iterative search. The minimum feasible fission factor; Step S5: Construct a cost model for the CNN model operator to quantify the execution time overhead of the memory hotspot subgraph. Calculate the time overhead of the memory hotspot subgraph before and after fission when fissioning according to the minimum feasible fission factor based on the cost model of the CNN model operator. Determine a fission decision for the memory hotspot subgraph based on the cost before and the execution time overhead after fission. Apply the determined fission decision to the computation graph to obtain the current optimized computation graph. Step S6: Determine the storage allocation strategy for each intermediate tensor in the memory hotspot subgraph based on the fission decision determined in step S5; Step S7: Determine whether the current global memory peak of the computation graph exceeds the GSM capacity constraint. If yes, return to step S3; otherwise, output the final optimized computation graph and intermediate tensor storage allocation mapping table.
2. The adaptive memory allocation method for vector-accelerator-oriented inference CNNs according to claim 1, characterized in that, Step S3 includes: Step S3.1: Construct a tensor activity table based on the scheduling sequence of the computation graph. The active table is used to record the effective survival time of each tensor during the execution of the CNN model, and the scheduling sequence is a computation graph topology sort that satisfies the data dependencies of the CNN computation graph. Step S3.2: Calculate the memory usage at each scheduling time based on the tensor activity table constructed in step S3.1; Step S3.3: Obtain the memory hotspots of the computation graph: query the peak values from the memory usage at each scheduling time. The peak value is obtained by representing the peak memory usage during the execution of the computation graph. Timestamps in memory Query the tensor active table middle The set of target tensors that are active at all times is obtained, and the Op operations corresponding to each tensor in the target tensor set are obtained to obtain the memory hotspot subgraph.
3. The adaptive memory allocation method for vector-accelerator-oriented inference CNNs according to claim 2, characterized in that, Step S3.1 includes: Based on the computation graph-based scheduling sequence sched, the lifecycle active interval of each operator is calculated. , ],in Indicates the start time, that is, the point in time when the operator begins execution; The end time indicates the scheduling time when the output tensor generated by the operator is last used by the downstream consumer operator; the active interval [ , [] indicates the time range during which the operator remains active in memory; Based on the active range of each operator [ , Construct a tensor active table ,in, Represents the set of operators in a computation graph. Indicates the number of operators. Indicates the scheduling time; the rows of the tensor active table A represent operator nodes, and the columns correspond to timestamps and elements in the active table. The value can be: in, Operators timestamp It is active. 0 indicates an operator. timestamp It is inactive. , Operators The start and end times of the active range.
4. The adaptive memory allocation method for vector-accelerator-oriented inference CNNs according to claim 1, characterized in that, Step S4 includes: Step S4.1: Classify computation graph nodes: Based on the relative position of the nodes in the data flow, the nodes in the memory hotspot subgraph are divided into three categories: root node, middle node, and leaf node. Based on the computational characteristics of the operators, the operators are divided into sliding window operators and element-wise operators. Step S4.2: Reconstruct the height dimension of the memory hotspot tensor: Based on the classification of nodes in step S4.1, obtain the category of each operator, and perform a reverse traversal from the leaf node to the root node, based on the current fission factor. Recursively recalculate the output tensor height of each operator. With input tensor height The fission factor The initial value is the minimum fission factor, and the height dimension of the input tensor and output tensor of each operator is calculated iteratively and recursively until the height dimension of the input and output tensors of all nodes has been recalculated, and the height dimension reconstruction result of the memory hotspot tensor is obtained. Step S4.3: Fission of memory hotspot subgraph and reconstruction of computation graph: Based on the height dimension reconstruction result obtained in step S4.2, the memory hotspot subgraph in the original computation graph old_graph is split and reconstructed to generate an optimized new computation graph new_graph; Step S4.
4. Calculate the peak memory usage of the optimized new computation graph new_graph Compare the current peak memory usage. With GSM capacity constraints ,like Then record the current fission factor. To meet GSM capacity constraints If the minimum feasible fission factor is found, proceed to step S5; otherwise, let... Then return to step S4.2 to continue the search.
5. The adaptive memory allocation method for vector accelerator-oriented inference CNNs according to claim 4, characterized in that, In step S4.2, for leaf nodes, the height of the output tensor of the sliding window operator is... ,in For the floor function, input tensor height ,in, Step size, Given the convolution kernel size, if the current fission copy is at the beginning or end of the sequence, boundary padding correction is performed: the calculation of the input height is corrected to... ,in For the fill size, if (original input height + 2P - K) cannot be filled... If divisible, perform step-size divisibility correction: adjust the input height... Add 1; the height of the output tensor of the element-wise operator type is The input tensor height is The output tensor height of the sliding window operator at the intermediate and root node positions equals the input tensor height of the downstream operator. And perform the boundary filling correction and the step size divisibility correction; the output tensor height of the element-wise operator type at the intermediate node and root node positions. Input tensor height .
6. The adaptive memory allocation method for vector-accelerator-oriented inference CNNs according to claim 4, characterized in that, Step S4.3 includes: Step S4.3.1: Initialize the new computation graph: Copy the original computation graph old_graph to generate the initial new computation graph new_graph. According to the new tensor dimension reconstructed in step S4.2, traverse and update the input and output tensor dimensions of the operators of the memory hotspot subgraph in the current new computation graph new_graph. Use the corresponding memory hotspot in the current new computation graph new_graph structure as the first copy of the memory hotspot subgraph after fission. Step S4.3.2: Insert Concat operator: Insert a Concat operator after the leaf node of the first replica of the memory hotspot subgraph to be used as a concatenation operator to merge the outputs of multiple fission replicas; Step S4.3.3: Construct the mapping tables of the computation graph: Construct the vertex mapping table nameToVertex of the computation graph to store the mapping from vertex names to vertex pointers in the graph, construct the operator mapping table nameToOp to store the mapping from operator names to operator pointers, and construct the tensor mapping table nameToValue to store the mapping from all tensor names to tensor pointers; Step S4.3.4: Create and connect new subgraph copies: if the fission factor of the current memory hotspot subgraph is... When the value is greater than 1, new memory hotspot subgraph copies are generated and connected sequentially. Step S4.3.5: Update the predecessor and successor relationships of each operator in the new computation graph new_graph to generate the optimized new computation graph new_graph.
7. The adaptive memory allocation method for vector-accelerator-oriented inference CNNs according to claim 6, characterized in that, In step S4.3.4, the steps for generating and connecting the new memory hotspot subgraph copy are as follows: Create new operator node: For each memory hotspot subgraph replica, traverse the hotspot subgraph in data flow hierarchy order, create an OP-type replica operator new_op for the original operator original_op, and set the new operator name; Create a new operator node's input tensor: Create a new tensor new_input for each non-parametric input of the replica operator new_op, set the tensor name, establish the definition relationship with the corresponding upstream operator of the replica, and assign the new input tensor to the input of the new operator; Create output tensors for new operator nodes: Create a new tensor new_output for each output of the copy operator new_op and set the name of the new output tensor, copy all the attributes of the original output, establish the output tensor definition relationship, and add the new output tensor to the output tensor set of the new operator new_op; Establish connections between operators: connect the input of the replica operator new_op to the output of the upstream operator of the corresponding replica, and connect the output of the replica operator new_op to the input of the downstream operator of the corresponding replica. If it is a leaf node replica, connect it to the input of the concatenation operator concat_op. Update computation graph: Insert the newly created operator into the current new computation graph new_graph; The newly created operators and input / output tensor information are synchronously updated in the vertex mapping table, operator mapping table, and tensor mapping table.
8. The adaptive memory allocation method for vector-oriented accelerator inference CNN according to any one of claims 1 to 7, characterized in that, In step S5, the time cost before the fission of the memory hotspot subgraph is calculated. Including the floating-point computation overhead and data transfer overhead of all operators, the calculation expression is: in, This represents the peak floating-point computing performance of the vector accelerator. Indicates DDR memory access bandwidth. For the operators corresponding to the memory hotspot subgraph, Represents the first [number] in the memory hotspot subgraph. An operator, Indicates the number of bytes occupied by the data type. express The floating-point computational complexity of the operator express Memory usage of operator weight tensors express Memory usage of operator input tensors express Memory usage of operator output tensors; Execution time overhead after memory hotspot subgraph splitting Including the additional computational and data transmission overhead introduced by fission, the calculation expression is: in, Indicates GSM storage bandwidth. Indicates the first part The floating-point computational complexity of the operator Indicates the first part Memory usage of operator weight tensors No. part Memory usage of operator input tensors No. part The memory usage of the operator output tensor, where N represents the number of sub-plots generated after the fission; The step of determining the fission decision for the memory hotspot subgraph based on the cost before fission and the execution time cost after fission includes: if the execution time cost after fission is less than the time cost before fission, then fission is performed on the current memory hotspot subgraph; otherwise, fission is not performed on the current memory hotspot subgraph.
9. The adaptive memory allocation method for vector-oriented accelerator inference CNN according to any one of claims 1 to 7, characterized in that, Step S6 includes: A tensor storage allocation map is defined to record the final storage allocation decision for each tensor. Each record in the tensor storage allocation map corresponds to a tensor in the computation graph. Each record includes the following fields: sequence number, tensor identifier, decision result, and storage location. The decision result represents the fission decision of the entire memory hotspot subgraph associated with the current tensor. The storage location is used to indicate the final physical storage location of the current tensor on the target hardware. If the fission decision of the current memory hotspot subgraph is fission, then update the tensor storage allocation mapping table, mark the decision result of the intermediate tensor generated by the current memory hotspot subgraph as fission, and mark the storage location as GSM; if the decision result of the memory hotspot subgraph is no fission, then update the tensor storage allocation mapping table, mark the fission decision result of the intermediate tensor generated by the current memory hotspot subgraph as no fission, and mark the storage location as DDR.
10. A computer device comprising a processor and a memory, the memory being used to store a computer program, characterized in that, The processor is used to execute the computer program to perform the method as described in any one of claims 1 to 9.