An image classification method and device based on dynamic parameterization
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TSINGHUA UNIVERSITY
- Filing Date
- 2026-04-28
- Publication Date
- 2026-06-19
AI Technical Summary
Transformers have high computational complexity when processing high-resolution images or long sequences, which limits their scalability in the field of computer vision. Furthermore, convolutional neural networks have a limited receptive field, making it difficult to model long-distance dependencies.
The static linear and deep convolutional modules in the convolutional neural network are replaced with dynamic linear and dynamic deep convolutional modules. Global modeling is achieved through a dynamically parameterized multilayer perceptron. The weights of the dynamic deep convolutional and dynamic linear modules are dynamically generated based on the input features and used for forward propagation to achieve global interaction, thus avoiding explicit attention weight calculation.
While maintaining linear computational complexity, it achieves global modeling capabilities comparable to the Transformer, significantly reduces computational overhead, and is suitable for high-resolution images.
Smart Images

Figure CN122244635A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence, and in particular to an image classification method and device based on dynamic parameterization. Background Technology
[0002] In recent years, Transformers have been widely used in computer vision due to their powerful global modeling capabilities. The self-attention mechanism, in particular, achieves global interaction by explicitly calculating the similarity between all labeled pairs. However, the computational complexity of this mechanism increases quadratically with the length of the input sequence, resulting in a significant computational and memory burden when processing high-resolution images or long sequences, thus limiting its scalability in high-resolution image and long sequence processing. To alleviate this problem, researchers have proposed various efficient attention variants such as sparse attention, low-rank approximation, and kernelized attention, but these methods still rely on explicitly calculating weights. Within the paradigm of "weighted aggregation".
[0003] Meanwhile, convolutional neural networks remain competitive in visual tasks due to their local inductive bias and linear complexity, but their limited receptive field makes it difficult to model long-range dependencies. Furthermore, related methods still treat attention weights as explicitly generated and applied to feature reorganization, failing to fundamentally change the nature of attention computation. Summary of the Invention
[0004] In view of the above-mentioned technical problems, the present invention provides an image classification method and device based on dynamic parameterization, which aims to overcome the above problems or at least partially solve the above problems.
[0005] The first aspect of this invention provides an image classification method based on dynamic parameterization, the method comprising: By replacing the first static linear module and the static deep convolutional module in the convolutional neural network with the dynamic linear module and the dynamic deep convolutional module respectively, a target image classification model is obtained. The images captured by the drone are input into the feature extraction module of the target image classification model to obtain the first input feature matrix; The first input feature matrix is input into the dynamic depth convolution module to obtain the first dynamic weight increment; the first dynamic weight is obtained based on the first dynamic weight increment and the static weight of the dynamic depth convolution module; the first input feature matrix is processed based on the first dynamic weight and the first bias of the dynamic depth convolution module to obtain the first output feature matrix. The second input feature matrix is obtained based on the first input feature matrix and the first output feature matrix; The second input feature matrix is input into the dynamic linear module to obtain the second dynamic weight increment; the second dynamic weight is obtained based on the second dynamic weight increment and the static weight of the dynamic linear module; the second input feature matrix is processed based on the second dynamic weight and the second bias of the dynamic linear module to obtain the second output feature matrix. Based on the second output feature matrix, the classification results of the images collected by the UAV are obtained.
[0006] A second aspect of the present invention provides an electronic device comprising a processor, a memory, and a program or instructions stored in the memory and executable on the processor, wherein the program or instructions, when executed by the processor, implement the steps of the image classification method based on dynamic parameterization as described in the first aspect of the present invention.
[0007] In the image classification method based on dynamic parameterization proposed in this invention, the target image classification model is obtained by replacing the first static linear module and the static deep convolutional module in the convolutional neural network with a dynamic linear module and a dynamic deep convolutional module, respectively. After obtaining the target image classification model, the image captured by the UAV is input into the feature extraction module of the target image classification model to obtain the first input feature matrix. The first input feature matrix is then input into the dynamic deep convolutional module to obtain the first dynamic weight increment that dynamically changes with the first input feature matrix. Based on the first dynamic weight increment and the static weights of the dynamic deep convolutional module, the first dynamic weight that dynamically changes with the first input feature matrix is obtained. Finally, based on the first dynamic weight and... The first bias is used to process the first input feature matrix to obtain the first output feature matrix output by the dynamic depthwise convolution module. Then, based on the first input feature matrix and the first output feature matrix, a second input feature matrix is obtained and input into the dynamic linear module to obtain the second dynamic weight increment that dynamically changes with the second input feature matrix. Then, based on the second dynamic weight increment and the static weights of the dynamic linear module, the second dynamic weight that dynamically changes with the second input feature matrix is obtained. Then, based on the second dynamic weight and the second bias, the second input feature matrix is processed to obtain the second output feature matrix output by the dynamic linear module. Finally, based on the second output feature matrix, the classification result of the image collected by the UAV is obtained from the target image classification model. Thus, this invention interprets the self-attention mechanism as a dynamically parameterized multilayer perceptron. By introducing dynamic deep convolutional modules and dynamic linear modules, it breaks away from the paradigm of explicit attention weight calculation and compresses global context information into dynamic weights related to input features. That is, the weights of the dynamic deep convolutional modules and dynamic linear modules are dynamically generated based on the input features. Global modeling is then implicitly achieved through forward propagation, thereby achieving global interaction through dynamic parameters. This effectively covers the entire image with the receptive field and achieves global modeling capabilities comparable to the Transformer while maintaining linear computational complexity, significantly reducing computational overhead. Attached Figure Description
[0008] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0009] Figure 1 This is a flowchart illustrating the steps of an image classification method based on dynamic parameterization according to an embodiment of the present invention; Figure 2 This is a schematic diagram of the structure of a dynamically parameterized feature processing block according to an embodiment of the present invention; Figure 3 This is a schematic diagram showing a comparison of computational efficiency between a target image classification model and a DeiT-T model according to an embodiment of the present invention; Figure 4 This is a schematic diagram comparing a dynamic linear module and a dynamic depthwise convolution module according to an embodiment of the present invention; Figure 5 This is a schematic diagram of an electronic device according to an embodiment of the present invention. Detailed Implementation
[0010] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0011] Currently, in recent years, the Transformer architecture has become mainstream in the field of computer vision, achieving state-of-the-art results in various visual tasks. Its success is largely attributed to the self-attention mechanism, which has been widely adopted as a core innovation for long-range dependency modeling. The self-attention mechanism has become the main mechanism for global dependency modeling in Transformer models. It generates attention weights by explicitly calculating pairwise label similarity, thereby reweighting and aggregating features in the value representation. This process inherently introduces quadratic complexity, with the computational complexity increasing quadratically with the length of the input sequence.
[0012] Based on this, this invention has found that the self-attention mechanism can be interpreted as a dynamically parameterized multilayer perceptron (MLP), meaning that the self-attention mechanism is mathematically equivalent to a multilayer perceptron with dynamically predictive parameters. In this model, global information is compressed into input conditional parameters, and global modeling is implicitly achieved through forward propagation of the input. From this perspective, explicitly calculating attention weights is not necessary; instead, dynamic parameter prediction itself acts as a mechanism for global context integration. Specifically, given an input feature matrix... (Where N is the sequence length and d is the feature dimension), self-attention output It can be written as: ; in, For the first A query vector, , , The key projection matrix, The weights are the projection matrix. This expression can be interpreted as a two-layer dynamic MLP: the first layer's weights are the key matrix. The activation function is Softmax, and the weights of the second layer are value matrices. These two weight matrices are not static, but dynamically generated based on the input. The input itself is processed by the resulting dynamic multilayer perceptron to obtain the global modeling result. From this perspective, the self-attention mechanism is not an explicit pairwise weight calculation, but an instance of input conditional parameter prediction, where the relationship between labels is implicitly encoded in the dynamic model parameters and the forward pass. From this perspective, the dynamically predicted parameters act as a compact representation, compressing the global information in the input and achieving implicit global modeling through input forward pass without relying on explicit label-level weighting. Therefore, self-attention is essentially a dynamic multilayer perceptron. Introducing dynamic parameterization into convolutional neural networks, which already possess inductive bias favorable to visual data and have linear complexity, may simultaneously achieve better performance and efficiency.
[0013] Based on this, in order to at least partially solve one or more of the above-mentioned problems and other potential problems, this invention proposes a global visual modeling method based on a dynamically parameterized linear complexity network. This method treats the self-attention mechanism as a dynamically parameterized MLP, and global information is compressed into dynamic weights through input conditional parameter prediction. This breaks away from the paradigm of explicit attention weight calculation and transforms global modeling into a process of input conditional parameter prediction and forward propagation. Specifically, the self-attention mechanism is interpreted as a dynamically parameterized multilayer perceptron. By introducing dynamic deep convolutional modules and dynamic linear modules, whose weights are dynamically generated based on input features, global modeling is implicitly achieved through forward propagation, and global interaction is implicitly achieved through dynamic parameters. This effectively covers the entire image with a receptive field and achieves global modeling capabilities comparable to the Transformer while maintaining linear computational complexity, significantly reducing computational overhead.
[0014] Please refer to Figure 1 , Figure 1 This is a flowchart illustrating the steps of an image classification method based on dynamic parameterization, as shown in an embodiment of the present invention. Figure 1 As shown, the image classification method based on dynamic parameterization provided in this embodiment includes at least the following steps: Step S11: Replace the first static linear module and the static deep convolutional module in the convolutional neural network with the dynamic linear module and the dynamic deep convolutional module, respectively, to obtain the target image classification model.
[0015] In this embodiment, the convolutional neural network includes at least a feature extraction module, a first static linear module, and a static deep convolutional module. As mentioned earlier, this invention, through analysis, reveals that self-attention can be interpreted as a dynamically parameterized two-layer multilayer perceptron. Based on this, this embodiment applies dynamic parameters to the convolutional neural network: since linear layers and deep convolutional layers are commonly used modules in convolutional neural networks, this embodiment considers transforming the linear layers and deep convolutional layers into dynamic linear layers and dynamic deep convolutional layers to achieve dynamic parameterization. Specifically, this embodiment can replace the first static linear module in the convolutional neural network with a dynamic linear module, and replace the static deep convolutional module in the convolutional neural network with a dynamic deep convolutional module to obtain a target image classification model to be trained. This target image classification model is then trained as a dynamically parameterized convolutional neural network for image classification. In this embodiment, the first static linear module can be any static linear module in the convolutional neural network. In this embodiment, the static linear module is a linear module with fixed weights (i.e., a conventional linear layer), and the static depthwise convolutional module is a depthwise convolutional module with fixed weights (i.e., a conventional depthwise convolutional layer).
[0016] In one optional example, the first static linear module in the convolutional neural network can be replaced with a dynamic linear module to be trained, and the static deep convolutional module in the convolutional neural network can be replaced with a dynamic deep convolutional module to be trained, resulting in a target image classification model to be trained. Then, a sample image carrying a class label is input into the target image classification model to be trained, and the predicted class of the sample image is obtained from the output of the target image classification model. Based on the predicted class and the class label of the sample image, the target image classification model to be trained is then trained (i.e., the parameters of each module in the target image classification model are updated) until training is complete, resulting in trained modules (such as a trained feature extraction module, a trained dynamic linear module, and a trained dynamic deep convolutional module), thus obtaining the trained target image classification model, i.e., the aforementioned target image classification model. The trained feature extraction module is the feature extraction module in the target image classification model, and the trained dynamic linear module and the trained dynamic deep convolutional module are the dynamic linear module and the dynamic deep convolutional module in the target image classification model.
[0017] Step S12: Input the image collected by the UAV into the feature extraction module of the target image classification model to obtain the first input feature matrix.
[0018] In this embodiment, the target image classification model can be used to classify images captured by a drone. The target image classification model includes a feature extraction module. The image captured by the drone is input into the target image classification model, and the feature extraction module extracts features from the image captured by the drone (such as performing embedding encoding) to obtain the first input feature matrix output by the feature extraction module.
[0019] Step S13: Input the first input feature matrix into the dynamic depth convolution module to obtain the first dynamic weight increment; obtain the first dynamic weight based on the first dynamic weight increment and the static weight of the dynamic depth convolution module; process the first input feature matrix based on the first dynamic weight and the first bias of the dynamic depth convolution module to obtain the first output feature matrix.
[0020] In this embodiment, the first input feature matrix output by the feature extraction module is input to the dynamic deep convolution module in the target image classification model. In the dynamic deep convolution module, the first input feature matrix is first processed to obtain a first dynamic weight increment related to the first input feature matrix; then, based on the first dynamic weight increment and the static weights of the dynamic deep convolution module, a first dynamic weight related to the first input feature matrix is obtained; finally, based on the first dynamic weight and the first bias of the dynamic deep convolution module, the first input feature matrix is processed to obtain a first output feature matrix, which is the output of the dynamic deep convolution module. It is understood that the processing procedures of each dynamic deep convolution module in this embodiment are the same or similar, and the processing method of step S13 can be referred to.
[0021] In this embodiment, the static weights and first bias of the dynamic deep convolution module are fixed weights and fixed biases obtained through model training, respectively. At the start of model training, the initial weights and initial biases of the dynamic deep convolution module to be trained are randomly initialized. During model training, the initial weights and initial biases of the dynamic deep convolution module to be trained are continuously updated until training ends, resulting in a trained dynamic deep convolution module. These weights and biases in the trained dynamic deep convolution module are the static weights and first bias of the dynamic deep convolution module. During the inference application of the target image classification model, the static weights and first bias of the dynamic deep convolution module do not change with the input. The first dynamic weight is the dynamic convolution kernel weight, which changes dynamically with the change of the first input feature matrix, achieving spatial adaptive feature aggregation.
[0022] In one embodiment, the static weights of the dynamic depthwise convolution module are: The first bias is The first output feature matrix is ,in For batch size, For the number of channels, This refers to the spatial dimensions.
[0023] In one optional embodiment, the first dynamic weight is the sum of the first dynamic weight increment and the static weight of the dynamic depth convolution module.
[0024] In one optional embodiment, after obtaining the first dynamic weight, the first input feature matrix can be subjected to depthwise convolution (each channel is convolved independently) based on the first dynamic weight, and a first bias can be added to obtain the first output feature matrix.
[0025] While depthwise convolution provides spatial inductive bias, it is inherently localized. This embodiment extends the principle of dynamic parameterization to the depthwise convolution module by predicting the depthwise convolution kernel weights (i.e., the first dynamic weights) associated with the input. Specifically, the depthwise convolution module dynamically generates depthwise convolution kernel weights based on input features, and these first dynamic weights are applied through depthwise convolution. This naturally maintains linear complexity relative to spatial resolution. Thus, by dynamically predicting the depthwise convolution kernel weights in the global context, the depthwise convolution module can achieve a Transformer-level global receptive field through dynamic parameterization.
[0026] Step S14: Obtain the second input feature matrix based on the first input feature matrix and the first output feature matrix.
[0027] In this embodiment, a second input feature matrix can be obtained based on a first input feature matrix and a first output feature matrix. For example, the second input feature matrix can be obtained by concatenating or fusing the first input feature matrix and the first output feature matrix.
[0028] Step S15: Input the second input feature matrix into the dynamic linear module to obtain the second dynamic weight increment; obtain the second dynamic weight based on the second dynamic weight increment and the static weight of the dynamic linear module; process the second input feature matrix based on the second dynamic weight and the second bias of the dynamic linear module to obtain the second output feature matrix.
[0029] In this embodiment, the second input feature matrix is input to the dynamic linear module in the target image classification model. In the dynamic linear module, the second input feature matrix is first processed to obtain a second dynamic weight increment related to the second input feature matrix. Then, based on the second dynamic weight increment and the static weights of the dynamic linear module, a second dynamic weight related to the second input feature matrix is obtained. Finally, based on the second dynamic weights and the second bias of the dynamic linear module, the second input feature matrix is processed to obtain a second output feature matrix, which is the output of the dynamic linear module. It is understood that the processing procedures for each dynamic linear module in this embodiment are the same or similar, and the processing method of step S15 can be referred to.
[0030] In this embodiment, the static weights and second bias of the dynamic linear module are the fixed weights and fixed biases of the trained dynamic linear module, obtained through model training. At the start of model training, the initial weights and initial biases of the dynamic linear module to be trained are randomly initialized. During model training, the initial weights and initial biases of the dynamic linear module to be trained are continuously updated until training ends, resulting in a trained dynamic linear module. The weights and biases in this trained dynamic linear module are the static weights and second bias of the dynamic linear module. During the inference application of the target image classification model, the static weights and second bias of the dynamic linear module do not change with the input. The second dynamic weight changes dynamically with the change of the second input feature matrix.
[0031] It is understood that in this embodiment, the "static weights" will not change with the input, while the "dynamic weights (such as the first dynamic weight or the second dynamic weight)" will change with the input. "Static" means that the parameters will not change during the model inference process, while "dynamic" means that the parameters will change during the model inference process, such as the dynamic weights changing with different inputs.
[0032] In an optional embodiment, the second dynamic weight is the sum of the second dynamic weight increment and the static weight of the dynamic linear module.
[0033] In one optional embodiment, after obtaining the second dynamic weights, the second input feature matrix can be processed based on the second dynamic weights and then supplemented with a second bias to obtain the second output feature matrix. For example, the second output feature matrix... ;in, This is the second input feature matrix. For batch size, This refers to either the sequence length or the number of spatial locations. Input the number of channels; As the second dynamic weight, For the second bias, Number of output channels Among them, the static weights of the dynamic linear module can be... express.
[0034] This embodiment takes into account that the first static linear module operates on the channel dimension and does not perform label-level mixing. However, when the parameters of the static linear module are conditionalized based on global input statistics, they can implicitly integrate the global context into the channel transformation. Based on this, this embodiment designs a dynamic linear module whose weights are dynamically adjusted by global feature statistics.
[0035] Step S16: Based on the second output feature matrix, obtain the classification result of the image collected by the UAV.
[0036] In this embodiment, after obtaining the second output feature matrix output by the dynamic linear module, the target image classification model can perform subsequent processing based on the second output feature matrix to obtain the classification result of the image collected by the UAV, thereby realizing image classification of the image collected by the UAV.
[0037] In this embodiment, the self-attention mechanism is interpreted as a dynamically parameterized multilayer perceptron. By introducing dynamic deep convolutional modules and dynamic linear modules, it breaks away from the paradigm of explicit attention weight calculation. Global context information is compressed into dynamic weights related to input features. That is, the weights of the dynamic deep convolutional modules and dynamic linear modules are dynamically generated based on the input features. Global modeling is then implicitly achieved through forward propagation, thereby achieving global interaction through dynamic parameters. This effectively covers the entire receptive field and achieves global modeling capabilities comparable to the Transformer while maintaining linear computational complexity, significantly reducing computational overhead. Furthermore, the dynamic parameters (dynamic weights) in this embodiment are generated statistically based on global features and are independent of the input sequence length, thus maintaining linear complexity and making it applicable to high-resolution images.
[0038] Furthermore, it is understood that in this embodiment, the process of inputting sample images carrying category labels into the target image classification model to be trained during model training, and obtaining the predicted category of the sample image output by the target image classification model to be trained, is the same as or similar to the process of inputting images collected by the UAV into the target image classification model to obtain the classification result of the UAV-collected image output by the target image classification model during model inference application. The training process can refer to the inference application process.
[0039] In conjunction with the above embodiments, in one implementation, the present invention also provides an image classification method based on dynamic parameterization. In this method, the step S13 above, "inputting the first input feature matrix into the dynamic depth convolution module to obtain the first dynamic weight increment," specifically includes steps S21 to S24: Step S21: Perform adaptive average pooling on the first input feature matrix through the pooling layer in the dynamic deep convolution module to obtain the downsampled feature matrix.
[0040] In this embodiment, after the first input feature matrix is input into the dynamic depthwise convolution module, the dynamic depthwise convolution module includes a pooling layer. The first input feature matrix can be subjected to adaptive average pooling through the pooling layer to obtain a downsampled feature matrix. The pooling layer can be an adaptive average pooling layer.
[0041] This embodiment considers that if the pooling layer is a global average pooling layer, performing global average pooling on the first input feature matrix to obtain a downsampled feature matrix, although all spatial dimensions can be compressed, the inherent spatial structure of the input is ignored, which may limit the adaptability of the generated convolutional kernel. Therefore, this embodiment determines the pooling layer to be an adaptive average pooling layer, performing adaptive average pooling on the first input feature matrix to obtain a downsampled feature matrix, thus preserving structural information and achieving spatial adaptability.
[0042] Step S22: Perform a convolution operation on the downsampled feature matrix through the first convolutional layer in the first multilayer perceptron of the dynamic depth convolution module to obtain the first convolution result.
[0043] In this embodiment, the dynamic depthwise convolution module further includes a first multilayer perceptron. The first multilayer perceptron can predict the kernel weight increment from the downsampled feature matrix. Specifically, the first multilayer perceptron processes the downsampled feature matrix, mapping it to the kernel space to obtain the first dynamic weight increment. In one embodiment, the first dynamic weight increment can be obtained using the following formula. : ; For the first multilayer perceptron, This is the downsampled feature matrix.
[0044] Specifically, the first multilayer perceptron is a lightweight multilayer perceptron, comprising: a first convolutional layer, a GELU activation function, and a second convolutional layer. The convolutional kernels of both the first and second convolutional layers are 1. 1. Convolutional Kernel. After obtaining the downsampled feature matrix, the first convolutional layer in the first multilayer perceptron can be used to perform a convolution operation on the downsampled feature matrix to obtain the first convolution result.
[0045] Step S23: Process the first convolution result using the GELU activation function in the first multilayer perceptron to obtain the first intermediate result.
[0046] In this embodiment, the first convolution result can be input into the GELU activation function in the first multilayer perceptron, and the first convolution result can be processed by the GELU activation function to obtain the first intermediate result.
[0047] Step S24: Perform a convolution operation on the first intermediate result through the second convolutional layer in the first multilayer perceptron to obtain the first dynamic weight increment.
[0048] In this embodiment, the first intermediate result can be input into the second convolutional layer of the first multilayer perceptron, and the first intermediate result can be convolved by the second convolutional layer to obtain the first dynamic weight increment that changes with the first input feature matrix.
[0049] In a specific alternative example, the first dynamic weight increment can be obtained by the following formula. : Where MLP represents the first multilayer perceptron, Let X be the downsampled feature matrix, X be the first input feature matrix, and Reshape be the dimensionality transformation (dimension conversion) operation.
[0050] In a specific alternative example, to preserve structural information, a spatially adaptive strategy is proposed: instead of using a global scalar, the input is reduced to a fixed K×K resolution grid aligned with the target. This is achieved by downsampling the first input feature matrix (feature map) using adaptive average pooling (AAP), calculating the spatial correlation matrix, and then mapping it to the convolutional kernel parameters. For example, if K=3, the first input feature matrix is... ,in For batch size, For the number of channels, Given the spatial dimensions, the adaptive average pooling layer applies a certain amount of time to the first input feature matrix. Perform adaptive average pooling to space size The downsampled feature matrix is obtained. Then, the downsampled feature matrix is processed by the first multilayer perceptron. The process yields the first dynamic weight increment. The adaptive average pooling operation embodies spatial adaptation.
[0051] In conjunction with any of the above embodiments, the present invention also provides an image classification method based on dynamic parameterization. In this method, the step S13 above, "inputting the first input feature matrix into the dynamic depth convolution module to obtain the first dynamic weight increment," specifically includes the following steps S31 to S34: Step S31: Perform adaptive average pooling on the first input feature matrix through the pooling layer in the dynamic deep convolution module to obtain the downsampled feature matrix.
[0052] Step S31 in this embodiment is the same as or similar to step S21 in the previous embodiment, and the relevant content of step S21 can be referred to.
[0053] Step S32: Process the first input feature matrix using the Sigmoid activation function, a global average pooling layer, and preset parameters to obtain the activation intermediate processing result.
[0054] In this embodiment, the first input feature matrix can be processed using the Sigmoid activation function, a global average pooling layer, and preset parameters to obtain the activation intermediate processing result. The preset parameters are the trained parameters in the dynamic deep convolution module.
[0055] In a specific optional example, the activation intermediate processing result can be determined by the following formula. : Wherein, Sigmoid is the Sigmoid activation function, GAP is the global average pooling layer, W is a preset parameter, and X is the first input feature matrix. That is, after downsampling the first input feature matrix based on the global average pooling layer, it is then compared with the preset parameter to obtain the first result. Finally, the first result is processed by the Sigmoid activation function to obtain the activation intermediate processing result.
[0056] Step S33: The downsampled feature matrix is processed by the second multilayer perceptron in the dynamic deep convolution module to obtain intermediate processing results.
[0057] In this embodiment, the dynamic depthwise convolution module includes a second multilayer perceptron, which can process the downsampled feature matrix to obtain intermediate processing results. In one embodiment, the second multilayer perceptron may be the same as or different from the first multilayer perceptron; this is not limited.
[0058] Step S34: Obtain the first dynamic weight increment based on the intermediate processing result and the activation intermediate processing result.
[0059] In this embodiment, after obtaining the intermediate processing result and the activation intermediate processing result, the first dynamic weight increment can be obtained based on the intermediate processing result and the activation intermediate processing result.
[0060] In a specific alternative example, the first dynamic weight increment can be obtained based on the magnitude-direction decoupling strategy using the following formula. : ; in, To activate the intermediate processing results, The feature matrix is a downsampled matrix, and MLP is a second-layer perceptron. To handle intermediate results, F is the Frobenius norm, and a small ε is added here to prevent division by zero.
[0061] In conjunction with any of the above embodiments, in one implementation, the present invention also provides an image classification method based on dynamic parameterization. In this method, and specifically, step S13 above, "inputting the first input feature matrix into the dynamic depth convolution module to obtain the first dynamic weight increment," may include the following steps S41 to S42: Step S41: The first input feature matrix is processed by the convolutional network in the dynamic depth convolution module to obtain the intermediate convolutional processing result.
[0062] In this embodiment, the dynamic depthwise convolution module includes a convolutional network, which is a lightweight convolutional network. This convolutional network includes two 3D networks. The network consists of 3 convolutional layers (with a channel bottleneck structure) and the GELU activation function. This convolutional network can be used to process the first input feature matrix to obtain intermediate convolutional processing results.
[0063] Step S42: Perform adaptive average pooling on the intermediate convolution result through the adaptive average pooling layer in the dynamic depth convolution module to obtain the first dynamic weight increment.
[0064] In this embodiment, the dynamic depthwise convolution module includes an adaptive average pooling layer, which can be used to perform adaptive average pooling on the intermediate convolution results to obtain the first dynamic weight increment.
[0065] In a specific alternative example, the first dynamic weight increment can be obtained based on the convolution strategy using the following formula. : Where X is the first input feature matrix, For a convolutional network, AAP is the adaptive average pooling operation, (K,K) is the spatial size, and K=3.
[0066] In conjunction with any of the above embodiments, in one implementation, the present invention also provides an image classification method based on dynamic parameterization. In this method, the step S15 above, "inputting the second input feature matrix into the dynamic linear module to obtain the second dynamic weight increment," specifically includes steps S51 to S53: Step S51: Perform global average pooling on the second input feature matrix through the global average pooling layer in the dynamic linear module to obtain the global average pooling feature matrix.
[0067] To maintain linear complexity, parameter (dynamic weight) generation must be independent of the size of the input features. Based on this, this embodiment performs global compression on the feature domain using a pooling strategy. Specifically, the dynamic linear module includes a global average pooling layer. After the second input feature matrix is input into the dynamic linear module, the global average pooling layer performs a global average pooling operation on the second input feature matrix to compress the sequence, obtaining a globally average pooled feature matrix.
[0068] Step S52: The global average pooling feature matrix is processed by the third multilayer perceptron in the dynamic linear module to obtain the second intermediate result.
[0069] In this embodiment, the dynamic linear module includes a third multilayer perceptron, which is a lightweight multilayer perceptron. The third multilayer perceptron can be used to process the global average pooling feature matrix to obtain a second intermediate result.
[0070] Step S53: Perform dimensional transformation on the second intermediate result to obtain the second dynamic weight increment.
[0071] In this embodiment, after obtaining the second intermediate result, a dimensionality transformation operation is performed on the second intermediate result to obtain the second dynamic weight increment. In an optional example, the second dynamic weight increment can be determined by the following formula. : ;in, GAP is the global average pooling layer, X is the second input feature matrix, MLP is the third multilayer perceptron, and Reshape is the dimension transformation operation.
[0072] In conjunction with any of the above embodiments, in one implementation, the present invention finds that while determining the second dynamic weight increment through the above-mentioned pooling strategy (using global average pooling GAP to compress the sequence, and then mapping it to a weight increment via a lightweight MLP) is efficient, compressing the entire sequence into a single vector may result in the loss of subtle feature interactions. To overcome this deficiency, the present invention also provides an image classification method based on dynamic parameterization. In this method, the step S15 above, "inputting the second input feature matrix into the dynamic linear module to obtain the second dynamic weight increment," may specifically include steps S61 to S64, or steps S61 to S63 and step S65: Step S61: Obtain the correlation matrix based on the transpose of the second input feature matrix and the second input feature matrix.
[0073] To maintain linear complexity, parameter (dynamic weight) generation must be independent of the size of the input features. Based on this, this embodiment compresses global second-order statistics through a correlation strategy to capture the interaction of higher-order features. The correlation strategy involves generating weight increments using a correlation matrix through linear or nonlinear mapping.
[0074] In this embodiment, it can first be based on the second input feature matrix. transpose of the second input feature matrix The correlation matrix is obtained. .
[0075] Step S62: Process the correlation matrix using the first parameter to obtain a third intermediate result, and process the third intermediate result using the second parameter to obtain a fourth intermediate result.
[0076] In this embodiment, the dynamic linear module includes a first parameter, a second parameter, a third parameter, and a fourth parameter, all of which are trained parameters within the dynamic linear module. The first parameter can be used to perform a linear mapping on the correlation matrix to obtain a third intermediate result, and then the second parameter can be used to perform a linear mapping on the third intermediate result to obtain a fourth intermediate result.
[0077] In one embodiment, during the training process of the target image classification model to be trained, the model parameters of the target image classification model to be trained can be updated according to the predicted category of the sample image and the category label carried by the sample image. For example, the relevant parameters of the feature extraction module to be trained, the relevant parameters of the dynamic linear module to be trained (such as the initial weights, initial biases, first parameters to be trained, second parameters to be trained, third parameters to be trained, and fourth parameters to be trained), and the relevant parameters of the dynamic deep convolution module to be trained (such as the initial weights, initial biases, and preset parameters to be trained) are updated until the training is completed, so as to obtain the trained feature extraction module, the trained dynamic linear module (the trained parameters in this module include at least the static weights, second biases, first parameters, second parameters, third parameters, and fourth parameters in the dynamic linear module), and the trained dynamic deep convolution module (the trained parameters in this module include at least the static weights, first biases, and preset parameters in the dynamic deep convolution module), so as to obtain the trained target image classification model.
[0078] Step S63: Process the fourth intermediate result using the SiLU activation function to obtain the fifth intermediate result.
[0079] In this embodiment, the dynamic linear module includes the SiLU activation function. After obtaining the fourth intermediate result, the fourth intermediate result can be processed (nonlinear mapping) by the SiLU activation function to obtain the fifth intermediate result.
[0080] Step S64: Process the fifth intermediate result using the third and fourth parameters to obtain the second dynamic weight increment.
[0081] In this embodiment, after obtaining the fifth intermediate result, the prediction (i.e. the fifth intermediate result) can be further decomposed into a low-rank transformation based on a deep strategy to reduce the number of floating-point operations: the fifth intermediate result is linearly mapped through the third and fourth parameters to obtain the second dynamic weight increment.
[0082] Step S65: Determine the fourth intermediate result or the fifth intermediate result as the second dynamic weight increment.
[0083] In this embodiment, the fourth intermediate result can be determined as the second dynamic weight increment based on a linear mapping. Alternatively, the fifth intermediate result can be determined as the second dynamic weight increment based on a nonlinear mapping.
[0084] In an optional specific example, the second dynamic weight increment It can be obtained based on the following formula using a linear mapping strategy: ,in, For the correlation matrix, , These are the first parameter and the second parameter, respectively.
[0085] In an optional specific example, the second dynamic weight increment It can be obtained based on the following formula through a nonlinear mapping strategy: ,in, For the correlation matrix, , These are the first parameter and the second parameter, respectively. This is the SiLU activation function.
[0086] In an optional specific example, the second dynamic weight increment It can be obtained using a depth strategy based on the following formula: ,in, For the correlation matrix, , , and These are the first parameter, the second parameter, the third parameter, and the fourth parameter, respectively. This is the SiLU activation function.
[0087] In conjunction with any of the above embodiments, in one implementation, the present invention also provides an image classification method based on dynamic parameterization. In this method, step S15, "inputting the second input feature matrix into the dynamic linear module to obtain the second dynamic weight increment," specifically includes steps S71 to S74: Step S71: Perform a pooling operation on the second input feature matrix through the pooling layer in the dynamic linear module to obtain a pooled feature matrix.
[0088] In this embodiment, the dynamic linear module includes a pooling layer. After the second input feature matrix is input into the dynamic linear module, the pooling layer can be used to perform a pooling operation on the second input feature matrix to obtain a pooled feature matrix. In an optional example, the pooling layer can be an adaptive average pooling layer, which can be used to perform an adaptive average pooling operation on the second input feature matrix to obtain a pooled feature matrix.
[0089] Step S72: Process the transpose of the pooling feature matrix using the first parameter, the SiLU activation function, and the second parameter to obtain the sixth intermediate result.
[0090] In this embodiment, a bilateral activation strategy is proposed to achieve input-related weight adjustment. This bilateral activation strategy processes the transpose and pooling feature matrix of the pooling feature matrix through two nonlinear branches, and multiplies the outputs of the two branches to generate a second dynamic weight increment.
[0091] The dynamic linear module includes four parameters: a first parameter, a second parameter, a third parameter, and a fourth parameter, all of which are trained parameters within the dynamic linear module. The processing of the first nonlinear branch involves processing the transpose of the pooling feature matrix using the first parameter, the SiLU activation function, and the second parameter to obtain the sixth intermediate result. In a specific example, the transpose of the pooling feature matrix can be obtained first, then processed using the second parameter to obtain the second result, then processed using the SiLU activation function to obtain the third result, and finally processed using the first parameter to obtain the sixth intermediate result.
[0092] Step S73: Process the pooling feature matrix using the third parameter, the SiLU activation function, and the fourth parameter to obtain the seventh intermediate result.
[0093] In this embodiment, the processing of the second nonlinear branch includes: processing the pooling feature matrix using the third parameter, the SiLU activation function, and the fourth parameter to obtain the seventh intermediate result. In a specific example, the pooling feature matrix can be processed first using the third parameter to obtain the fourth result, then the fourth result can be processed using the SiLU activation function to obtain the fifth result, and finally the fifth result can be processed using the fourth parameter to obtain the seventh intermediate result.
[0094] Step S74: Based on the sixth intermediate result and the seventh intermediate result, obtain the second dynamic weight increment.
[0095] In this embodiment, the second dynamic weight increment can be obtained based on the processing results of two complementary nonlinear branches (i.e., the sixth intermediate result and the seventh intermediate result). In a specific example, the second dynamic weight increment can be obtained by multiplying the sixth intermediate result and the seventh intermediate result.
[0096] In an optional specific example, the second dynamic weight increment It can be obtained using a two-sided activation strategy based on the following formula: Where X is the pooling feature matrix, This is the transpose of the pooling feature matrix. , , and These are the first parameter, the second parameter, the third parameter, and the fourth parameter, respectively. This is the SiLU activation function.
[0097] In conjunction with any of the above embodiments, in one implementation, to further reduce the computational load, this embodiment calculates the correlation matrix... Previously, an adaptive average pooling (AAP) operation was applied to the second input feature matrix X along the spatial dimension to reduce the size by a factor of 2. This improved accuracy by suppressing high-frequency noise while preserving the key global context.
[0098] In conjunction with any of the above embodiments, in one embodiment, this embodiment designs multiple strategies for determining the dynamic weight increments for both the dynamic linear module and the dynamic depthwise convolution module, as shown in Table 1. Table 1 is a table of strategies for determining dynamic weight increments. Specifically, in Table 1... This represents the dynamic weight increment.
[0099] Table 1. Strategy for Determining Dynamic Weight Increments
[0100] In conjunction with any of the above embodiments, in one implementation, the present invention also provides an image classification method based on dynamic parameterization. In this method, the target image classification model includes N dynamically parameterized feature processing blocks, where N is an integer greater than 1; and step S16 above specifically includes steps S81 to S84: Step S81: Process the second output feature matrix using the GELU activation function in the dynamically parameterized feature processing block to obtain the third input feature matrix.
[0101] This embodiment proposes a dynamically parameterized feature processing block. This dynamically parameterized feature processing block is a convolutional network structure based on dynamic parameter prediction; that is, it is a linear complexity dynamic convolutional architecture. In this embodiment, except for replacing the first static linear module in the convolutional neural network with a dynamic linear module, the second static linear module and the feature extraction module remain unchanged. Therefore, during model training, the model parameters of the dynamic linear module, the dynamic deep convolutional module, the second static linear module, and the feature extraction module are updated until training is complete, resulting in a target image classification model. In this embodiment, the first static linear module is the static linear module in the convolutional neural network to be replaced with a dynamic linear module, and the second static linear module is the static linear module in the convolutional neural network that does not need to be replaced with a dynamic linear module. A dynamically parameterized feature processing block includes: a GELU activation function, a dynamic linear module, a dynamic deep convolutional module, and a second static linear module. The second static linear module is a pre-trained second static linear module.
[0102] In this embodiment, the first input feature matrix output by the feature extraction module is input to the first dynamically parameterized feature processing block. As described in steps S13-S15 above, the first dynamically parameterized feature processing block is processed sequentially by a dynamic depthwise convolution module and a dynamic linear module to obtain the second output feature matrix output by the dynamic linear module. Then, the second output feature matrix is processed by the GELU activation function in the first dynamically parameterized feature processing block to obtain the third input feature matrix output by the GELU activation function.
[0103] Step S82: Input the third input feature matrix into the second static linear module to obtain the third output feature matrix.
[0104] In this embodiment, the first dynamically parameterized feature processing block inputs the third input feature matrix into the second static linear module, and the second static linear module processes the third input feature matrix to obtain the third input feature matrix output by the second static linear module.
[0105] Step S83: Based on the second input feature matrix and the third output feature matrix, obtain the output feature matrix of a dynamically parameterized feature processing block, which serves as the input feature matrix of the next dynamically parameterized feature processing block.
[0106] In this embodiment, N dynamically parameterized feature processing blocks are connected sequentially. The first dynamically parameterized feature processing block concatenates or fuses the third input feature matrix output by the second static linear module and the second input feature matrix input to the dynamic linear module to obtain the output feature matrix of the first dynamically parameterized feature processing block. This output feature matrix serves as the input feature matrix of the next dynamically parameterized feature processing block (i.e., the output feature matrix serves as the input of the second dynamically parameterized feature processing block), until the output feature matrix of the last dynamically parameterized feature processing block is obtained. The processing procedures of the second to Nth dynamically parameterized feature processing blocks are the same as or similar to those of the first dynamically parameterized feature processing block, and will not be described again.
[0107] Step S84: Based on the output feature matrix of the last dynamically parameterized feature processing block, obtain the classification result of the image acquired by the UAV.
[0108] In this embodiment, after obtaining the output feature matrix of the last dynamically parameterized feature processing block, the classification result of the image acquired by the UAV can be obtained based on the output feature matrix of the last dynamically parameterized feature processing block. It can be understood that the target image classification model in this embodiment includes: one feature extraction module and N dynamically parameterized feature processing blocks.
[0109] The dynamically parameterized feature processing block proposed in this embodiment can achieve global visual modeling without explicitly calculating attention weights by compressing global context information into input-related dynamic weights. While maintaining linear computational complexity, it also achieves high performance on ImageNet. It achieves global modeling capabilities comparable to the Transformer on 1K classification tasks while significantly reducing computational overhead. It also demonstrates excellent performance on vision tasks such as ImageNet. It balances efficiency and is suitable for visual tasks such as high-resolution image classification, object detection, and semantic segmentation.
[0110] In a preferred embodiment, the first dynamic weight increment in the dynamic depthwise convolution module is obtained using a spatial adaptive strategy, and the second dynamic weight increment in the dynamic linear module is obtained using a bilateral activation strategy.
[0111] In one embodiment, such as Figure 2 As shown, Figure 2 This is a schematic diagram of the structure of a dynamically parameterized feature processing block according to an embodiment of the present invention. Figure 2 This indicates that the target image classification model comprises N dynamically parameterized feature processing blocks. The input to the first dynamically parameterized feature processing block is the first input feature matrix (such as EmbeddedPatches) encoded by the feature extraction module. The structure of each dynamically parameterized feature processing block is shown in the dashed box. Each dynamically parameterized feature processing block includes: a dynamic deep convolutional module, a dynamic linear module, an activation function (such as the GELU activation function), and a static linear module. The dynamic deep convolutional module predicts the first dynamic weight increment through a spatial adaptive strategy, and the dynamic linear module predicts the second dynamic weight increment through a bilateral activation strategy.
[0112] In one embodiment, the dynamically parameterized feature processing block proposed in this embodiment is compared with representative models based on Transformer and Convolutional Neural Networks on the ImageNet-1K dataset. The results show that the Transformer-based DeiT-S has 22M parameters, 4.6 G FLOPs (floating-point operations), and a top-1 accuracy of 79.8%; the ConvNeXt-S based on Convolutional Neural Networks has 22M parameters, 4.3 G FLOPs, and a top-1 accuracy of 79.7%; and the model based on the dynamically parameterized feature processing block has 28M parameters, 4.4 G FLOPs, and a top-1 accuracy of 81.3%.
[0113] As can be seen, the model based on dynamically parameterized feature processing blocks achieves comparable or better top-1 accuracy while maintaining a similar or lower number of parameters and FLOPs (floating-point operations). For example, the model based on dynamically parameterized feature processing blocks achieves a top-1 accuracy of 81.3% with 28M parameters and 4.4G FLOPs, which is better than DeiT-S based on Transformer (79.8%) and ConvNeXt-S based on convolutional neural networks (79.7%), while using comparable computational resources. These results demonstrate that the dynamically parameterized feature processing blocks proposed in this embodiment can effectively combine the adaptive modeling capability of Transformer with linear complexity operations, achieving global modeling while maintaining linear time and memory complexity. It achieves competitive and often superior accuracy across different architectural paradigms, and is highly efficient, especially in high-resolution input scenarios.
[0114] In conjunction with any of the above embodiments, in one implementation, the present invention also provides an image classification method based on dynamic parameterization. In this method, in addition to the steps described above, step S91 may also be included, and step S11 may specifically include step S92: Step S91: Use the first feature processing block of the convolutional neural network as the first feature processing block of the target image classification model.
[0115] In this embodiment, the target image classification model includes N feature processing blocks, where N is an integer greater than 1, and each feature processing block is a convolutional network architecture. The first feature processing block in the convolutional neural network can be used as the first feature processing block in the target image classification model. Each feature processing block in the convolutional neural network is a typical convolutional network structure, such as each typical convolutional network structure including: a static deep convolutional module, a first static linear module, a GELU activation function, and a second static linear module.
[0116] Step S92: Starting from the second feature processing block of the convolutional neural network, every k feature processing blocks, perform the following step: replace the first static linear module and the static deep convolutional module of the feature processing block in the convolutional neural network with the dynamic linear module and the dynamic deep convolutional module, respectively, to obtain the target image classification model.
[0117] In this embodiment, keeping the feature extraction module in the convolutional neural network unchanged, starting from the second feature processing block of the convolutional neural network, every k feature processing blocks (k being an integer greater than 0), the following steps are performed: replacing the first static linear module of the feature processing block in the convolutional neural network with a dynamic linear module, and replacing the static deep convolutional module of the feature processing block with a dynamic deep convolutional module, to obtain the target image classification model. In this embodiment, for the second feature processing block, the (2+k+1)th feature processing block, the (2+k+1+k+1)th feature processing block, the (2+k+1+k+1)th feature processing block, and so on, the following steps are performed respectively: replacing the first static linear module of the feature processing block in the convolutional neural network with a dynamic linear module, and replacing the static deep convolutional module of the feature processing block with a dynamic deep convolutional module, to obtain the target image classification model.
[0118] In other words, this embodiment keeps the feature extraction module in the convolutional neural network unchanged. Starting from the second feature processing block of the convolutional neural network, this second feature processing block is replaced with a dynamically parameterized feature processing block to be trained. Then, every k feature processing blocks after this block are also replaced with a dynamically parameterized feature processing block to be trained. The dynamically parameterized feature processing block to be trained includes: a dynamically deep convolutional module to be trained, a dynamically linear module to be trained, a GELU activation function to be trained, and a second static linear module to be trained. For example, starting from the second feature processing block, dynamic parameterization is applied once every three feature processing blocks, while the remaining feature processing blocks remain static. In other words, if k=2, then in the target image classification model to be trained, the first feature processing block is a feature processing block in a convolutional neural network (i.e., a normal convolutional network structure), the second feature processing block is a dynamically parameterized feature processing block to be trained, the third and fourth feature processing blocks are feature processing blocks in a convolutional neural network, the fifth feature processing block is a dynamically parameterized feature processing block to be trained, and so on, to obtain the target image classification model to be trained. The parameters of each model in the target image classification model to be trained are then updated to obtain the target image classification model (i.e., the trained target image classification model).
[0119] In this embodiment, the second static linear module and some feature processing blocks remain unchanged to selectively integrate dynamic parameterization, balancing modeling capability and computational efficiency. This allows for control of parameter overhead while ensuring image classification performance. Importantly, the dynamic parameters in this embodiment are predicted from global feature statistics, thereby compressing the global context into input conditional weights and achieving implicit global modeling through simple forward propagation. By decoupling parameter generation from sequence length, the dynamically parameterized feature processing block can maintain strictly linear time and memory complexity and effectively scale to high-resolution inputs.
[0120] In one embodiment, as shown in Table 2, various strategies were designed to determine the dynamic weight increments for both the dynamic linear module and the dynamic deep convolutional module. Table 2 presents the ablation experiment data for the dynamic weight prediction strategies of the linear and deep convolutional modules. The baseline model uses static weights. To improve efficiency, dynamic parameterization was implemented only for one-third of the modules, while the remaining layers remained static. "Dynamic linear module 1+2" indicates that dynamic parameterization was implemented in both the first and second static linear modules. Table 2 reports the performance of different linear dynamic weight prediction strategies on ImageNet-1K. Among all strategies, the bilateral activation strategy applied to the first dynamic linear module achieved the best accuracy-efficiency balance, achieving a top-1 accuracy of 76.4% with moderate parameter and floating-point operation (FLOP) overhead. Extending it to two linear modules further improved the accuracy (76.7%), but at a higher cost and lower throughput, indicating diminishing returns when applying dynamic parameters to deeper layers. Table 2 also reports the performance of different deep convolutional dynamic weight prediction strategies on ImageNet-1K. While the amplitude-direction decoupling strategy and the convolution strategy offer a slight improvement in accuracy (0.2%), they have higher parameter counts, higher FLOPs, and lower throughput. In contrast, the spatial adaptation strategy preserves spatial structure while maintaining computational efficiency during convolution weight prediction, providing the best practical compromise. Therefore, this embodiment combines the most effective and efficient strategies from the dynamic linear module and the dynamic depthwise convolution module. Specifically, a bilateral activation strategy is applied to the first linear module (dynamic linear module), and a spatial adaptation strategy is applied to the dynamic depthwise convolution module, corresponding to "Dynamic Linear Module 1 + Dynamic Depthwise Convolution Module" in Table 2. This achieves the highest overall accuracy (76.8%) with only a moderate increase in computational cost, making it the most efficient and effective strategy in this study. The static convolutional neural network in Table 2 refers to a regular convolutional neural network (i.e., the convolutional neural network in the above embodiment).
[0121] Table 2 Ablation experimental data for dynamic weight prediction strategies of linear modules and depthwise convolutional modules.
[0122] In one embodiment, such as Figure 3 As shown, Figure 3 This is a schematic diagram comparing the computational efficiency of a target image classification model and a DeiT-T model according to an embodiment of the present invention. Figure 3The throughput and GPU memory usage per image of the target image classification model on an RTX 3090 were compared with those of the DeiT-T model. Thanks to the linear time and memory complexity of the dynamically parameterized feature processing blocks, the target image classification model can scale smoothly with increasing resolution. At a resolution of 1248×1248 (6084 labels), the target image classification model achieved a 7.7x increase in throughput and a 91% reduction in memory usage compared to DeiT-T, demonstrating that the target image classification model proposed in this invention is suitable for high-resolution visual tasks.
[0123] In one embodiment, such as Figure 4 As shown, Figure 4 This is a schematic diagram comparing a dynamic linear module and a dynamic depthwise convolution module according to an embodiment of the present invention. Figure 4 In the network, the relative strength of dynamic parameters exhibits a clear depth-dependent characteristic. For the dynamic linear module, its ratio r remains stable at nearly 1 across all depth levels, indicating that the channel mixing transformation is consistently regulated by input condition updates. In contrast, the dynamic deep convolutional module displays a significantly larger ratio in deeper networks, suggesting that the strength of spatial adaptive transformation gradually increases at higher semantic levels. This phenomenon indicates that the dynamic deep convolutional module plays a more important role in deep feature extraction, while the dynamic linear module provides a stable global channel-level regulation mechanism. Overall, the dynamic linear module and the dynamic deep convolutional module form a complementary and synergistic effect in the network depth dimension.
[0124] In summary, this invention re-examines the design philosophy of the Transformer model, which centers on the self-attention mechanism, interpreting it as a dynamic MLP with input conditional parameters. Based on this perspective, this invention proposes a convolutional architecture with dynamically parameterized feature processing blocks—this architecture achieves global modeling by dynamically predicting linear and depthwise convolutional weights. This design maintains linear complexity (related to input size) while implicitly integrating the global context, thus avoiding the use of explicit attention mechanisms. Extensive experiments on the ImageNet-1K dataset demonstrate that the dynamically parameterized feature processing blocks exhibit competitive or even superior performance compared to state-of-the-art convolutional neural networks and visual Transformer models, while reducing computational costs and memory footprint. Beyond its practical efficiency advantages, this invention also emphasizes that dynamic parameterization strategies offer a promising and universally applicable solution for building scalable and highly expressive architectures.
[0125] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of the present invention are not limited to the described order of actions, because according to the embodiments of the present invention, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions involved are not necessarily essential to the embodiments of the present invention.
[0126] The terms "first," "second," etc., used in the specification and claims of this invention are used to distinguish similar objects and are not used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of the invention can be implemented in orders other than those illustrated or described herein. Furthermore, in the specification and claims, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects are in an "or" relationship.
[0127] Based on the same inventive concept, another embodiment of the present invention provides an electronic device, such as... Figure 5 As shown, Figure 5 This is a schematic diagram of an electronic device according to an embodiment of the present invention. The electronic device includes a memory, a processor, and a program or instructions stored in the memory and executable on the processor. When the program or instructions are executed by the processor, they implement the steps in the image classification method based on dynamic parameterization described in any of the above embodiments of the present invention.
[0128] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element. Furthermore, it should be noted that the scope of the methods and apparatuses in the embodiments of the present invention is not limited to performing functions in the order shown or discussed, but may also include performing functions substantially simultaneously or in the reverse order, depending on the functions involved. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
[0129] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in the various embodiments of the present invention.
[0130] The embodiments of the present invention have been described above with reference to the accompanying drawings. However, the present invention is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of the present invention without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of the present invention.
Claims
1. An image classification method based on dynamic parameterization, characterized in that, The method includes: By replacing the first static linear module and the static deep convolutional module in the convolutional neural network with the dynamic linear module and the dynamic deep convolutional module respectively, a target image classification model is obtained. The images captured by the drone are input into the feature extraction module of the target image classification model to obtain the first input feature matrix; The first input feature matrix is input into the dynamic depth convolution module to obtain the first dynamic weight increment; the first dynamic weight is obtained based on the first dynamic weight increment and the static weight of the dynamic depth convolution module; the first input feature matrix is processed based on the first dynamic weight and the first bias of the dynamic depth convolution module to obtain the first output feature matrix. The second input feature matrix is obtained based on the first input feature matrix and the first output feature matrix; The second input feature matrix is input into the dynamic linear module to obtain the second dynamic weight increment; the second dynamic weight is obtained based on the second dynamic weight increment and the static weight of the dynamic linear module; the second input feature matrix is processed based on the second dynamic weight and the second bias of the dynamic linear module to obtain the second output feature matrix. Based on the second output feature matrix, the classification results of the images collected by the UAV are obtained.
2. The image classification method based on dynamic parameterization according to claim 1, characterized in that, The first input feature matrix is input into the dynamic depthwise convolution module to obtain the first dynamic weight increment, including: The first input feature matrix is subjected to adaptive average pooling through the pooling layer in the dynamic deep convolution module to obtain a downsampled feature matrix. The downsampled feature matrix is convolved through the first convolutional layer of the first multilayer perceptron in the dynamic depth convolution module to obtain the first convolution result. The first convolution result is processed by the GELU activation function in the first multilayer perceptron to obtain the first intermediate result; The first intermediate result is convolved through the second convolutional layer in the first multilayer perceptron to obtain the first dynamic weight increment. Both the first and second convolutional layers have a kernel size of 1.
1. Convolution kernel.
3. The image classification method based on dynamic parameterization according to claim 1, characterized in that, The first input feature matrix is input into the dynamic depthwise convolution module to obtain the first dynamic weight increment, including: The first input feature matrix is subjected to adaptive average pooling through the pooling layer in the dynamic deep convolution module to obtain a downsampled feature matrix. The first input feature matrix is processed using the Sigmoid activation function, a global average pooling layer, and preset parameters to obtain the activation intermediate processing result. The downsampled feature matrix is processed by the second multilayer perceptron in the dynamic deep convolution module to obtain intermediate processing results; The first dynamic weight increment is obtained based on the intermediate processing result and the activation intermediate processing result.
4. The image classification method based on dynamic parameterization according to claim 1, characterized in that, The first input feature matrix is input into the dynamic depthwise convolution module to obtain the first dynamic weight increment, including: The first input feature matrix is processed by the convolutional network in the dynamic depthwise convolution module to obtain the intermediate convolutional processing result. The convolutional network includes two 3D networks. 3 convolutional layers and GELU activation function; The first dynamic weight increment is obtained by performing an adaptive average pooling operation on the intermediate convolution result through the adaptive average pooling layer in the dynamic depth convolution module.
5. The image classification method based on dynamic parameterization according to claim 1, characterized in that, The second input feature matrix is input into the dynamic linear module to obtain the second dynamic weight increment, including: The correlation matrix is obtained by transposing the second input feature matrix and the second input feature matrix. The correlation matrix is processed by the first parameter to obtain a third intermediate result, and the third intermediate result is processed by the second parameter to obtain a fourth intermediate result. The fourth intermediate result is processed using the SiLU activation function to obtain the fifth intermediate result; The second dynamic weight increment is obtained by processing the fifth intermediate result using the third and fourth parameters; or The fourth intermediate result or the fifth intermediate result is determined as the second dynamic weight increment.
6. The image classification method based on dynamic parameterization according to claim 1, characterized in that, The second input feature matrix is input into the dynamic linear module to obtain the second dynamic weight increment, including: The second input feature matrix is pooled through the pooling layer in the dynamic linear module to obtain the pooled feature matrix. The sixth intermediate result is obtained by processing the transpose of the pooling feature matrix using the first parameter, the SiLU activation function, and the second parameter. The pooling feature matrix is processed by the third parameter, the SiLU activation function, and the fourth parameter to obtain the seventh intermediate result; The second dynamic weight increment is obtained based on the sixth intermediate result and the seventh intermediate result.
7. The image classification method based on dynamic parameterization according to any one of claims 1 to 6, characterized in that, The target image classification model includes N dynamically parameterized feature processing blocks; except for replacing the first static linear module in the convolutional neural network with the dynamic linear module, the second static linear module and the feature extraction module in the convolutional neural network remain unchanged. A dynamically parameterized feature processing block includes: a GELU activation function, the dynamic linear module, the dynamic deep convolution module, and the second static linear module; N is an integer greater than 1; Based on the second output feature matrix, the classification results of the images acquired by the UAV are obtained, including: The second output feature matrix is processed by the GELU activation function in the dynamically parameterized feature processing block to obtain the third input feature matrix; The third input feature matrix is input into the second static linear module to obtain the third output feature matrix; Based on the second input feature matrix and the third output feature matrix, an output feature matrix of a dynamically parameterized feature processing block is obtained, which is then used as the input feature matrix of the next dynamically parameterized feature processing block. The classification result of the image acquired by the UAV is obtained based on the output feature matrix of the last dynamically parameterized feature processing block.
8. The image classification method based on dynamic parameterization according to any one of claims 1 to 6, characterized in that, The target image classification model includes N feature processing blocks; the method further includes: The first feature processing block of the convolutional neural network is used as the first feature processing block of the target image classification model; By replacing the first static linear module and the static depthwise convolutional module in the convolutional neural network with the dynamic linear module and the dynamic depthwise convolutional module, respectively, a target image classification model is obtained, including: Starting from the second feature processing block of the convolutional neural network, every k feature processing blocks, the following steps are performed: the first static linear module and the static deep convolutional module of the feature processing block in the convolutional neural network are replaced with the dynamic linear module and the dynamic deep convolutional module, respectively, to obtain the target image classification model; k is an integer greater than 0.
9. An electronic device, characterized in that, It includes a processor, a memory, and a program or instructions stored in the memory and executable on the processor, wherein the program or instructions, when executed by the processor, implement the steps of the image classification method based on dynamic parameterization as described in any one of claims 1 to 8.