Method and apparatus for image processing and model training

By inserting a channel tuning module into the transformer layer of the pre-trained model and training only the parameters of the channel tuning module, the overfitting problem of the pre-trained model on small datasets is solved, and the accuracy of the image processing model is improved.

CN115908969BActive Publication Date: 2026-06-26ALIBABA (CHINA) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ALIBABA (CHINA) CO LTD
Filing Date
2022-11-01
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In domains with limited data, using large-scale pre-trained models for image processing tasks can easily lead to overfitting, resulting in low model accuracy.

Method used

A channel tuning module is inserted into the transformer layer of the pre-trained model. The module only transforms the features of the target channel in the intermediate feature map. The parameters of the channel tuning module are trained using a small dataset, while keeping the original parameters of the pre-trained model unchanged.

Benefits of technology

This reduces the number of trainable parameters, avoids overfitting, and improves the accuracy of image processing models in specific tasks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115908969B_ABST
    Figure CN115908969B_ABST
Patent Text Reader

Abstract

The application provides an image processing and model training method and device. In the method of the application, when a pre-trained model is applied to a specific image processing task, an image processing model is obtained by inserting a channel tuning module in a transformer conversion layer of the pre-trained model, the channel tuning module is used for transforming features of at least one target channel in an intermediate feature map extracted by the conversion layer, during model training, based on a data set of the current image processing task, a channel containing richer features in the intermediate feature map is selected as the target channel, and parameters of the channel tuning module in the image processing model are trained using the data set, while original parameters of the pre-trained model remain unchanged, the number of trainable parameters can be greatly reduced, overfitting can be prevented, and the accuracy of the image processing model can be improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to computer technology, and more particularly to a method and apparatus for image processing and model training. Background Technology

[0002] Large-scale pre-trained models perform exceptionally well on various computer vision tasks, such as image classification, image segmentation, and image detection. Pre-training on large public datasets enables these models to learn rich visual representations, exhibiting robustness in both low- and high-level visual representations. These models can then be used in downstream image processing tasks to improve their performance.

[0003] When applied to image processing tasks, it is often necessary to use a large-scale labeled dataset to train all parameters of the pre-trained model so that the trained model is suitable for the specific image processing task.

[0004] However, in some fields (such as medicine and remote sensing), there is a lack of available data, or the available datasets are small due to data sensitivity. For image processing tasks in these fields, training a pre-trained model with a small dataset can easily lead to overfitting, resulting in low model accuracy when applied to image processing tasks. Summary of the Invention

[0005] This application provides a method and apparatus for image processing and model training, which solves the problem that when training a pre-trained model using a dataset in a specific image processing task, overfitting can easily occur, resulting in low accuracy of the model when applied to the image processing task.

[0006] In a first aspect, this application provides an image processing model training method, including:

[0007] The dataset for the image processing task and the image processing model to be trained are obtained by inserting a channel tuning module into the conversion layer of a pre-trained model.

[0008] The sample images in the dataset are input into the image processing model. The transformation layer extracts features from the sample images. The channel tuning module inserted in the transformation layer transforms the features of at least one target channel in the intermediate feature map extracted by the transformation layer.

[0009] The image processing result is determined based on the features of the final output of the conversion layer;

[0010] Based on the image processing results and the annotation information of the sample images, the parameters of the channel tuning module in the image processing model are trained to obtain a trained image processing model. The trained image processing model is used to process the input image to obtain the image processing results.

[0011] Secondly, this application provides a training method for a remote sensing image processing model, comprising:

[0012] Receive a remote sensing image dataset sent by a user equipment, wherein the remote sensing image dataset contains multiple remote sensing images and annotation information of the remote sensing images;

[0013] The image processing model to be trained is obtained by inserting a channel tuning module into the conversion layer of a pre-trained model.

[0014] The remote sensing image is input into the image processing model, and features are extracted from the remote sensing image through the transformation layer. The channel optimization module inserted in the transformation layer transforms the features of at least one target channel in the intermediate feature image extracted by the transformation layer.

[0015] Based on the features of the final output of the conversion layer, the image processing result of the remote sensing image is determined;

[0016] Based on the image processing results and annotation information of the remote sensing image, the parameters of the channel tuning module in the image processing model are trained to obtain the trained image processing model;

[0017] The model parameters of the trained image processing model are output to the user device.

[0018] Thirdly, this application provides an image processing method, including:

[0019] Obtain the image to be processed;

[0020] The image is input into a trained image processing model, and features are extracted from the image through the transformation layer of the image processing model. The channel tuning module inserted in the transformation layer transforms the features of at least one target channel in the intermediate feature map extracted by the transformation layer.

[0021] The image processing result of the image is determined based on the features of the final output of the conversion layer;

[0022] Output the image processing results.

[0023] Fourthly, this application provides an image processing model training system, comprising:

[0024] The edge device is used to construct a dataset for the image processing task and send the dataset for the image processing task to the cloud-side device;

[0025] The cloud-side device is used to receive a dataset for an image processing task, obtain an image processing model to be trained, and obtain the image processing model by inserting a channel tuning module into the transformation layer of a pre-trained model. Sample images from the dataset are input into the image processing model, and the transformation layer extracts features from the sample images. The channel tuning module inserted in the transformation layer transforms the features of at least one target channel in the intermediate feature map extracted by the transformation layer. The image processing result is determined based on the final output features of the transformation layer. The parameters of the channel tuning module in the image processing model are trained based on the image processing result and the annotation information of the sample images to obtain a trained image processing model.

[0026] The cloud-side device is also used to send the model parameters of the trained image processing model to the edge device.

[0027] Fifthly, this application provides an image processing model training apparatus, comprising:

[0028] The data acquisition unit is used to acquire the dataset for the image processing task and the image processing model to be trained. The image processing model is obtained by inserting a channel tuning module into the conversion layer of the pre-trained model.

[0029] An image processing unit is configured to input sample images from the dataset into the image processing model, extract features from the sample images through the transformation layer, and transform the features of at least one target channel in the intermediate feature map extracted by the transformation layer through a channel tuning module inserted in the transformation layer; and determine the image processing result based on the features finally output by the transformation layer.

[0030] The parameter training unit is used to train the parameters of the channel tuning module in the image processing model based on the image processing results and the annotation information of the sample images, so as to obtain a trained image processing model. The trained image processing model is used to process the input image to obtain the image processing results.

[0031] Sixthly, this application provides an image processing apparatus, comprising:

[0032] The image acquisition unit is used to acquire the image to be processed.

[0033] An image processing unit is configured to input the image into a trained image processing model, extract features from the image through a transformation layer of the image processing model, and transform the features of at least one target channel in the intermediate feature map extracted by the transformation layer through a channel tuning module inserted in the transformation layer; and determine the image processing result of the image based on the features finally output by the transformation layer.

[0034] The processing result output unit is used to output the image processing result.

[0035] In a seventh aspect, this application provides an electronic device, including: a processor, and a memory communicatively connected to the processor;

[0036] The memory stores computer-executed instructions;

[0037] The processor executes computer execution instructions stored in the memory to implement the method described in any of the above aspects.

[0038] Eighthly, this application provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the method described in any of the preceding aspects.

[0039] Ninthly, this application provides a computer program product, including a computer program that, when executed by a processor, implements the method described in any of the above aspects.

[0040] The image processing and model training method and apparatus provided in this application obtain an image processing model to be trained by inserting a channel tuning module into the transformer layer of a pre-trained model. The channel tuning module is used to transform the features of at least one target channel in the intermediate feature map extracted by the transformer layer. When transferring the pre-trained model to a specific downstream image processing task, the parameters of the channel tuning module in the image processing model are trained using the dataset of the specific image processing task, while keeping the original parameters of the pre-trained model unchanged. This can greatly reduce the number of trainable parameters, avoid overfitting of large-scale models to small training sets, and thus improve the accuracy of the model when applied to specific image processing tasks. Attached Figure Description

[0041] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.

[0042] Figure 1 A schematic diagram of an example network architecture to which this application applies;

[0043] Figure 2 This is a schematic diagram of an example system architecture to which this application applies;

[0044] Figure 3 A flowchart of an image processing model training method provided as an exemplary embodiment of this application;

[0045] Figure 4 A flowchart for transforming features of a target channel in an intermediate feature map, provided as an exemplary embodiment of this application;

[0046] Figure 5 A framework diagram for transforming features of a target channel in an intermediate feature map, provided as an exemplary embodiment of this application;

[0047] Figure 6 An example diagram showing the insertion position of the channel tuning module provided in an exemplary embodiment of this application;

[0048] Figure 7 An example diagram showing the insertion position of the channel tuning module provided as another exemplary embodiment of this application;

[0049] Figure 8 A flowchart of an image processing model training method provided as another exemplary embodiment of this application;

[0050] Figure 9 A flowchart illustrating a training method for a remote sensing image processing model provided in an exemplary embodiment of this application;

[0051] Figure 10 A flowchart of an image processing method provided as an exemplary embodiment of this application;

[0052] Figure 11 This is a schematic diagram of an image processing model training system provided in an example embodiment of this application;

[0053] Figure 12 This is a schematic diagram of the structure of an image processing model training device provided in an example embodiment of this application;

[0054] Figure 13 This is a schematic diagram of the structure of an image processing apparatus provided in an example embodiment of this application;

[0055] Figure 14 This is a schematic diagram of the structure of an electronic device provided in an example embodiment of this application.

[0056] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation

[0057] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0058] First, let me explain the terms used in this application:

[0059] Transformer layer: refers to the transformer layer in a neural network model. Each transformer layer contains two sub-layers: a first sub-layer and a second sub-layer. The output of the first sub-layer serves as the input to the second sub-layer. Taking the VisionTransformer (ViT) model as an example, the backbone network of the ViT model contains 12 transformer layers, and each transformer layer contains two sub-layers.

[0060] For image processing tasks with small datasets, training a pre-trained model with all parameters on a small dataset is prone to overfitting, leading to low model accuracy when applied to image processing tasks. This application provides an image processing model training method that generates an image processing model applicable to downstream image processing tasks by inserting a channel tuning module into the transformer layer of the pre-trained model. The channel tuning module transforms the features of at least one target channel in the intermediate feature map extracted by the transformer layer. During model training, the dataset for the image processing task is acquired, and sample images from the dataset are input into the image processing model for image processing to obtain the image processing result. Specifically, features are extracted from the sample images through a transformation layer, and the channel tuning module inserted in the transformation layer transforms the features of at least one target channel in the intermediate feature map extracted by the transformation layer; the image processing result is determined based on the final output features of the transformation layer. Further, the parameters of the channel tuning module in the image processing model are trained based on the image processing result and the annotation information of the sample images to obtain the trained image processing model. In the image processing model training process, the original parameters of the pre-trained model remain unchanged; only the parameters of the inserted channel tuning module need to be updated. Compared to training all the parameters of the pre-trained model, this significantly reduces the number of trainable parameters, solving the problem of overfitting when training large-scale pre-trained models on small datasets and improving the accuracy of the image processing model. The image processing model training method provided in this application achieves optimal accuracy by training only a very small number of parameters.

[0061] Figure 1 This is a schematic diagram of an example network architecture to which this application applies. Figure 1 As shown, the network architecture includes servers and electronic devices.

[0062] The server can be a server cluster deployed in the cloud or a local device with computing capabilities. This server stores pre-trained models that have been pre-trained on large-scale datasets, such as the pre-trained ViT model. The server also stores image processing models obtained by inserting channel tuning modules into the transformer layers of the pre-trained models, and can acquire datasets for specific image processing tasks, which are used to train the image processing models. When training the image processing model, the server keeps the original parameters of the pre-trained model fixed and trains the parameters of the channel tuning modules in the image processing model based on the dataset of the specific image processing task. After training, an image processing model for performing the current image processing task is obtained. Furthermore, the server can send the model parameters of the obtained image processing model to a designated electronic device.

[0063] The electronic device can be a client device that requests an image processing model from a server to perform specific image processing tasks. Specifically, it can be a computing device deployed locally by the user or a server deployed in the cloud.

[0064] The electronic device provides a dataset for a specific image processing task to the server and receives an image processing model trained by the server based on that dataset. Based on the trained image processing model, the electronic device can then provide the functionality to perform image processing tasks.

[0065] For example, in the field of remote sensing, there is limited availability of labeled data. Taking remote sensing image recognition as an example, the server trains an image processing model based on the dataset for the remote sensing image recognition task. This model is obtained by inserting a channel tuning module into the transformer layer of a pre-trained model. During training, only the parameters of the inserted channel tuning module are updated, while the original parameters of the pre-trained model remain unchanged. After training, a target model for performing remote sensing image recognition tasks is obtained. The trained target model can be deployed to a local server or another cloud server to provide the function of performing remote sensing image recognition tasks. When a remote sensing image recognition task needs to be performed, a device with the corresponding target model acquires the remote sensing image to be recognized, inputs the remote sensing image into the trained target model for remote sensing image recognition, obtains the recognition result, and outputs the recognition result, or applies the recognition result to other functional modules.

[0066] For example, Figure 2 This is a schematic diagram of an example system architecture to which this application applies. Figure 2As shown, the system architecture includes cloud-side devices, edge devices, and data production devices. The cloud-side devices communicate with the edge devices via edge-cloud links, and each edge device communicates with multiple data production devices.

[0067] Cloud-side devices can be central cloud devices in a distributed cloud architecture, while edge-side devices are edge cloud devices in the same architecture. Data production equipment includes various terminal devices, including but not limited to smartphones, laptops, tablets, and smart home appliances.

[0068] Data generation equipment is responsible for the production, collection, and uploading of various types of data. Edge devices collect data from data generation equipment within their coverage area and preprocess the data to obtain high-value data (critical information). Edge devices can then upload both raw and high-value data to cloud devices via the edge-cloud link. In addition to synchronizing data from edge devices, cloud devices are responsible for integrating data from different edge devices, performing data calculations according to preset rules, and synchronizing the results to different edge devices.

[0069] Cloud-side devices offer superior computing and storage capabilities but are located relatively far from users, while edge devices are deployed over a wider area and are closer to users. Edge devices are an extension of cloud-side devices, allowing the computing power of cloud devices to be pushed down to edge devices. Through integrated and collaborative management of the cloud and edge, business needs that cannot be met under a centralized cloud computing model can be addressed.

[0070] based on Figure 2 In the system architecture shown in this embodiment, the edge devices are responsible for collecting various types of data from data production devices within their coverage area, preprocessing the data to construct a dataset for image processing tasks, and uploading the dataset to the cloud devices. The cloud devices receive the image processing task datasets sent by the edge devices and integrate datasets of the same image processing tasks from different edge devices to form a larger dataset. The cloud devices train the image processing model based on the integrated large dataset. Furthermore, the cloud devices can distribute the model parameters of the trained image processing model to various edge devices, or deploy the image processing model according to a preset method.

[0071] The technical solution of this application and how the technical solution of this application solves the above-mentioned technical problems are described in detail below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of this application will now be described with reference to the accompanying drawings.

[0072] Figure 3 This is a flowchart illustrating an exemplary embodiment of the image processing model training method provided in this application. The execution entity of the method provided in this embodiment is the aforementioned... Figure 1 The server in the network architecture shown, or Figure 2 The cloud-side devices in the system architecture shown. For example... Figure 3 As shown, the specific steps of this method are as follows:

[0073] Step S301: Obtain the dataset for the image processing task and the image processing model to be trained. The image processing model is obtained by inserting a channel tuning module into the conversion layer of the pre-trained model.

[0074] This embodiment can be applied to various image processing tasks such as image classification, image recognition, image segmentation, and image detection. When applied to different image processing tasks, the channel tuning module in the image processing model can be trained based on the dataset of the specific image processing task to obtain an image processing model suitable for the specific image processing task.

[0075] The image processing task dataset refers to the training dataset used to train the image processing model to obtain a model suitable for the current image processing task.

[0076] For example, for two different image classification tasks, we can obtain the dataset for each task separately, and train the channel tuning module in the image processing model based on each dataset to obtain a model specifically applied to each image classification task. The pre-trained model parameters are consistent in the models used to perform different image classification tasks, but the channel tuning module has different parameters.

[0077] In this embodiment, when transferring the pre-trained model to a specific image processing task, a channel tuning module is inserted into the transformer layer of the pre-trained model. This module transforms the features of at least one target channel in the intermediate feature map extracted by the transformer layer, resulting in the image processing model to be trained. During the training of the image processing model, keeping the original parameters of the pre-trained model unchanged and only adjusting the parameters of the added channel tuning module significantly reduces the number of trainable parameters, avoiding overfitting of a large-scale model to a small training set.

[0078] Specifically, the backbone network of a pre-trained model typically includes multiple transformer layers. The image processing model to be trained can be obtained by inserting channel tuning modules into one or more transformer layers.

[0079] Preferably, the image processing model to be trained can be obtained by inserting a channel tuning module in each transformer layer to improve the expressive power of the trained image processing model.

[0080] For example, the pre-trained model can be a pre-trained ViT model. The ViT model has a large number of parameters, and when transferring to downstream tasks with small datasets, the fusion of all parameters of the trained ViT model can lead to overfitting. The backbone network of the ViT model contains 12 transformer layers, in which a channel tuning module is inserted into each transformer layer as the image processing model to be trained.

[0081] In addition, this embodiment does not specify the insertion position of the channel tuning module in the transformer layer.

[0082] Step S302: Input the sample images in the dataset into the image processing model, extract features from the sample images through the transformation layer, and transform the features of at least one target channel in the intermediate feature map extracted by the transformation layer through the channel tuning module inserted in the transformation layer.

[0083] Step S303: Determine the image processing result based on the features of the final output of the conversion layer.

[0084] When training an image processing model, sample images from the dataset are input into the model for image processing. During this process, the transformer layer is used to extract features from the sample images. The image processing result of the sample images can be determined based on the features finally output by the transformer layer.

[0085] In this embodiment, since a channel tuning module is inserted into the transformer layer, during the feature extraction process of the sample image in the transformer layer, the channel tuning module transforms the features of at least one target channel in the intermediate feature map generated at the insertion position according to the insertion position of the channel tuning module in the transformer layer, and obtains intermediate transformed features. The intermediate transformed features are then input into the part after the insertion position for further processing until the image processing result is obtained.

[0086] The final output feature of the transformation layer can be the feature output by the last transformer layer in the image processing model. This feature has been transformed by the channel tuning module in one or more transformer layers.

[0087] Step S304: Train the parameters of the channel tuning module in the image processing model based on the image processing results and the annotation information of the sample images to obtain the trained image processing model. The trained image processing model is used to process the input image to obtain the image processing results.

[0088] After obtaining the image processing results of the sample images, the loss can be calculated based on the image processing results and the annotation information of the sample images, and the parameters of the channel tuning module in the image processing model can be adjusted based on the loss; after multiple iterations, the trained image processing model is obtained after training is completed.

[0089] The trained image processing model is used to perform the current image processing task. Specifically, the image to be processed is input into the trained image processing model, which then performs image processing on the input image to obtain the image processing result.

[0090] In this embodiment, an image processing model to be trained is obtained by inserting a channel tuning module into the transformer layer of the pre-trained model. The channel tuning module is used to transform the features of at least one target channel in the intermediate feature map extracted by the transformer layer. When transferring the pre-trained model to a specific downstream image processing task, the parameters of the channel tuning module in the image processing model are trained using the dataset of the specific image processing task, while keeping the original parameters of the pre-trained model unchanged. This can greatly reduce the number of trainable parameters and avoid the overfitting problem that is common in large-scale pre-trained models due to excessive fine-tuning, thereby improving the accuracy of the model when applied to specific image processing tasks.

[0091] In addition, considering that adjusting only some channels in each transformer layer would destroy the integrity of the model and cause model degradation, this embodiment adopts an additional channel tuning module to realize the transformation of the features of important target channels, so as not to destroy the integrity of the original pre-trained model, avoid the degradation of the pre-trained model, and thus achieve the same or even better accuracy as the existing technology with very few training parameters.

[0092] See Figure 4 In an optional embodiment, for any conversion layer with an inserted channel tuning module, the channel tuning module inserted in the conversion layer transforms the features of at least one target channel in the intermediate feature map extracted by the conversion layer. Specifically, this can be achieved through the following steps:

[0093] Step S3021: Based on the insertion position of the channel tuning module in the conversion layer, extract the features of the target channel from the intermediate feature map obtained at the insertion position in the conversion layer.

[0094] In this embodiment, the number of target channels can be represented by K, which is a pre-set hyperparameter. The same K can be used for all datasets. The value of K can be determined and pre-configured based on experimental data on public datasets. For example, experimental results based on a certain public dataset show that selecting features from 32 target channels for transformation can achieve significant performance, at which point the number of trainable parameters is only 0.01M. The performance of the model increases with the increase of K, but the performance improvement of K=192 compared to K=96 is very small, while the number of parameters is four times larger. Considering effectiveness and efficiency, the number of target channels can be taken in the range of [32, 96]. For example, the number of target channels can be set to 96 by default. In this step, features from 96 target channels are extracted as the original features to be transformed in this transformer layer.

[0095] Optionally, the selection of K target channels can be done by randomly selecting K channels as target channels.

[0096] Optionally, the selection of the K target channels can be based on the dataset of the current image processing task. The importance of features in each channel can be analyzed, and the feature weights of the channels can be determined. The richer and more prominent the features in a channel, the more important the channel, and the greater its feature weight. Based on the feature weights of the channels, the K most important channels can be selected as target channels.

[0097] In this step, the intermediate feature map generated at the insertion position of the channel tuning module in the transformation layer is obtained. The dimension of this intermediate feature map can be represented as B×L×C. Where B is the number of sample images; L is the number of tokens in the transformer layer of the pre-trained model, that is, the number of blocks of the input image; and C is the number of channels in the transformer layer.

[0098] Furthermore, K target channel features are extracted from the intermediate feature map of B×L×C as the original features to be transformed, which can be represented as B×L×K.

[0099] Step S3022: The extracted features are linearly mapped through the channel optimization module to obtain the mapped features, and the mapped features are fused with the extracted features to obtain the fused features.

[0100] The channel optimization module performs a linear mapping on the features of the extracted K target channels to obtain the mapped features. Then, the mapped features are fused with the features of the extracted K target channels (the original features before transformation) to obtain the fused features. The dimension of the fused features is still B×L×K.

[0101] Optionally, when fusing the mapped features with the extracted features to obtain the fused features, the mapped features can be added to the extracted features to obtain the fused features.

[0102] Optionally, when fusing the mapped features with the extracted features to obtain the fused features, the mapped features and the extracted features can be weighted and summed according to preset weight coefficients to obtain the fused features. The preset weight coefficients include a first weight coefficient for the mapped features and a second weight coefficient for the extracted original features. The values ​​of the first and second weight coefficients can be set and adjusted empirically, and are not specifically limited here.

[0103] Step S3023: Replace the features of the target channel in the intermediate feature map with the fused features to obtain the intermediate transformation features.

[0104] The dimension of the fused features obtained in step S3022 above is still B×L×K. The fused features are split according to the channels to obtain the transformed features of K target channels. The features of the target channels in the intermediate feature map are replaced with the corresponding transformed features, so that the fused features are updated in the intermediate feature map to obtain the intermediate transformed features.

[0105] Step S3024: Input the intermediate transformation features into the part after the insertion position in the image processing model, and perform subsequent image processing to determine the image processing result of the sample image.

[0106] For example, Figure 5 A framework diagram for transforming features of the target channel in an intermediate feature map, provided as an exemplary embodiment of this application, is shown below. Figure 5 As shown, the intermediate feature map obtained from the insertion position of the transformer layer has a dimension of B×L×C. K target channels are selected based on channel feature weights, and features from these K target channels are extracted from the intermediate feature map. The extracted original features have a dimension of B×L×K. The channel tuning module performs a linear mapping on the extracted original features to obtain mapped features, which also have a dimension of B×L×K. The mapped features are then fused with the extracted original features to obtain fused features, which also have a dimension of B×L×K. Finally, the fused features replace the features of the K target channels in the intermediate feature map, resulting in the transformed intermediate features.

[0107] In this embodiment, the channel tuning module inserted into the transformer layer is a linear mapping layer with very few trainable parameters. When transforming the features of the K target channels extracted from the intermediate feature map, the linear mapping layer linearly maps the extracted original features, and then fuses the mapped features with the extracted original features to replace the intermediate feature map, thus obtaining the transformed intermediate features. This can greatly reduce the number of trainable parameters and make the final output features of the transformer layer contain rich and obvious features, thereby enabling the image processing model to have good expressive power when applied to the current image processing task, thereby improving the accuracy of the image processing model.

[0108] Based on any of the above embodiments, the transformer layer in the backbone network of the pre-trained model typically includes a first sub-layer and a second sub-layer, with the output of the first sub-layer serving as the input of the second sub-layer.

[0109] Taking the ViT model as an example, the first sub-layer of the transformer layer in the backbone network of the ViT model is the Multi-Head Self-Attention (MHSA) module, which includes two sub-layers: the MHSA layer and the LayerNorm. The second sub-layer is the Multilayer Perceptron (MLP) module, which includes two sub-layers: the MLP and the LayerNorm.

[0110] Optionally, the channel tuning module is inserted in the conversion layer between the first sub-layer and the second sub-layer, that is, after the first sub-layer and before the second sub-layer of the conversion layer.

[0111] For example, taking the ViT model as an example, such as Figure 6 As shown, the channel tuning module is inserted after the multi-head self-attention (MHSA) layer and before the normalization layer (LayerNorm) in the multilayer perceptron module, that is, between the first sub-layer and the second sub-layer.

[0112] Furthermore, in step S3021 above, features of the target channel are extracted from the intermediate feature map output by the first sub-layer of the conversion layer according to the insertion position of the channel tuning module in the conversion layer.

[0113] In step S3024 above, the intermediate transformed features obtained after the transformation are input into the second sub-layer after the insertion position.

[0114] Optionally, the channel tuning module is inserted in the conversion layer after the second sub-layer.

[0115] For example, taking the ViT model as an example, such as Figure 7 As shown, the channel tuning module is inserted after the normalization layer (LayerNorm) of the multilayer perceptron module, that is, after the second sub-layer.

[0116] Furthermore, in step S3021 above, features of the target channel are extracted from the intermediate feature map output by the second sub-layer of the conversion layer according to the insertion position of the channel tuning module in the conversion layer.

[0117] In step S3024 above, the intermediate transformed features obtained after the transformation are input into the next layer (transformer layer or other layer) after the insertion position.

[0118] Optionally, the channel tuning module can be inserted in the first sub-layer, such as between the normalization layer (LayerNorm) and the multi-head self-attention layer in the first sub-layer of the ViT model; or, the channel tuning module can be inserted in the second sub-layer, such as between the normalization layer (LayerNorm) and the multilayer perceptron (MLP) in the second sub-layer of the ViT model. In this embodiment, the channel tuning module can be inserted at any position in the transformation layer, and no specific limitation is made here.

[0119] In this embodiment, inserting a channel tuning module between the first and second sub-layers of the transformation layer is a preferred approach. After collecting long-term dependencies with the multi-head self-attention (MHSA) module, the features contain more significant and important channels, which can better adapt to downstream image processing tasks, such as image classification, image segmentation, image detection, and image recognition, so that the model has better performance when applied to specific downstream image processing tasks.

[0120] Based on any of the above embodiments, when selecting K target channels for feature transformation by the channel tuning module, K channels can be randomly selected from each channel of the intermediate feature map extracted by the transformation layer as the target channels of the transformation layer, which can also make the model have better performance.

[0121] In one optional embodiment, when selecting K target channels for any conversion layer, the feature weights of each channel in the conversion layer of the pre-trained model applied to the image processing task can be determined based on the dataset of the image processing task; based on the feature weights of each channel in the conversion layer, a preset number (K) of channels are selected as the target channels of the conversion layer. The preset number K is greater than or equal to 1.

[0122] The feature weights of each channel can be determined by analyzing the importance of features in each channel based on the dataset of the current image processing task. The richer and more obvious the features in a channel, the more important the channel is, and the greater the feature weight of the channel.

[0123] Specifically, the feature weights of each channel in the conversion layer are sorted, and the K channels with the largest feature weights are selected as the target channels of the conversion layer. For example, the feature weights of each channel in the conversion layer are sorted in descending order, and the top K channels are selected as the target channels of the conversion layer based on the sorting result.

[0124] In this embodiment, the importance of each channel in each transformation layer can be analyzed based on the current dataset. K important channels are selected as target channels in each transformation layer. The K target channels selected in different transformation layers can be different. Different target channels are used when applied to different datasets, which can improve the performance of the model.

[0125] Specifically, based on the dataset of the image processing task, the feature weights of each channel in the transformation layer of the pre-trained model are determined when applied to the image processing task. This can be achieved in the following way:

[0126] The sample images in the dataset are input into the pre-trained model, and the sample images are extracted through the transformation layer of the pre-trained model. According to the insertion position of the channel tuning module in the image processing model in the transformation layer, the intermediate feature map extracted by the transformation layer of the pre-trained model is obtained. The L2 normalized value of each channel feature in the intermediate feature map extracted by the transformation layer is used as the feature weight of each channel in the transformation layer.

[0127] For example, taking the ViT model as a pre-trained model, the ViT model's backbone network contains 12 transformer layers. Sample images from the dataset are input into the ViT model. Based on the insertion position of the channel tuning module, features generated at that insertion position are extracted from each transformer layer as intermediate feature maps. Let l represent any transformer layer, and the intermediate feature map extracted by the transformer layer can be represented as f. l The superscript 'l' indicates the transformer layer, and 'l' can take any integer value in the range [1, 12]. l The dimension of f is B×L×C, where B is the number of sample images in the dataset; L is the number of tokens in the transformer layer of the pre-trained model, i.e., the number of blocks in the input image; and C is the number of channels in the transformer layer. i l Representing the intermediate feature map f lIn the image, channel i represents any channel and can take any integer value in the interval [1, C]. To eliminate the influence of image offset, the intermediate feature map f will be processed. l Features of channel i in The values ​​obtained by performing L2 normalization are used as the feature weights for channel i. The feature weights of each channel are concatenated in order into a vector form, which is: in, express The value after L2 normalization. Concat() concatenates the L2 normalized values ​​of the individual channels. The dimension is 1×C. Channels are sorted according to their feature weights, and the K channels with the highest feature weights are selected as the target channels for this transformer layer l.

[0128] The specific implementation method for obtaining the intermediate feature map is the same as the method for obtaining the intermediate feature map from the insertion position in the transformation layer in S3021 above, and will not be repeated here.

[0129] In an optional embodiment, feature weights for each channel in the conversion layer of the pre-trained model can be determined based on a public dataset when applied to that dataset. A predetermined number (K) of channels are selected as target channels for the conversion layer based on these feature weights. When applied to downstream image processing tasks, the target channels determined based on the public dataset are used.

[0130] However, considering that the importance of channels varies from dataset to dataset in practical applications, it is recommended to use a dataset based on a specific image processing task to analyze the importance of each channel, select the target channel, and train an image processing model based on this to apply to the specific image processing task. This can improve the accuracy of the model when applied to the specific image processing task.

[0131] In one optional embodiment, the image processing task is an image classification task, and the annotation information of the sample images is the category information of the sample images. Based on the dataset of the image processing task, the feature weights of each channel in the transformation layer of the pre-trained model applied to the image processing task are determined, which can be implemented in the following way:

[0132] Based on the category information of the sample images in the dataset, the sample images of each category are input into the pre-trained model, and the sample images are used to extract features through the transformation layer of the pre-trained model. According to the insertion position of the channel tuning module in the image processing model in the transformation layer, the intermediate feature map extracted by the transformation layer of the pre-trained model is obtained. The L2 normalized value of the feature of each channel in the intermediate feature map extracted by the transformation layer is used as the feature weight of each channel in the transformation layer corresponding to that category. The mean of the feature weights of each channel in the transformation layer corresponding to each category is calculated and used as the feature weight of each channel in the transformation layer.

[0133] For example, taking the ViT model as a pre-trained model, the ViT model's backbone network contains 12 transformer layers. When applied to an image classification task, the dataset contains sample images for multiple categories. Taking the Caltech101 dataset as an example, it contains 101 classes and 1000 images. The sample images are grouped according to the classes in the dataset, with each group containing sample images of the same class. Let M represent the number of classes in the dataset, and N represent the number of classes. c N represents any group, corresponding to a category. c The integer can be any integer in the interval [1, M]. Taking the Caltech101 dataset as an example, M = 101, N... c It can take any integer in the range [1, 101] and can represent any category. Use... Refers to N c The number of sample images in this group. For each group N c group N c The sample images are input into the ViT model. Based on the insertion position of the channel tuning module, features generated at that insertion position are extracted from each transformer layer as intermediate feature maps. Let l represent any transformer layer. The intermediate feature map extracted by the transformer layer can be represented as... The superscript 'l' represents the transformer layer, and 'l' can take any integer in the range [1, 12]. The subscript 'N' represents the transformer layer. c This indicates the corresponding group / category. The dimension is in, For this group N c The number of sample images; L is the number of tokens in the transformer layer of the pre-trained model, i.e., the number of blocks in the input image; C is the number of channels in the transformer layer. Representing intermediate feature maps In the image, channel i represents any channel and can take any integer value in the interval [1, C]. To eliminate the influence of image offset, the intermediate feature map will be processed. Features of channel i in The value obtained by performing L2 normalization is used as the channel i corresponding to the category N. c The feature weights of each channel are concatenated in order into a vector, which is: in, express The value after L2 normalization. Concat() concatenates the L2 normalized values ​​of the individual channels. The dimension is 1×C. The final feature weight for each channel is determined by calculating the mean of the feature weights for each category, which can be expressed as: Among them, Z l This represents the final feature weight for each channel. According to Z... l Sort the data and select the feature weights Z based on the sorting results. l The K highest channels are used as the target channels of the transformer layer l.

[0134] In this embodiment, when applied to image classification tasks, the feature weights of each channel are estimated by combining the characteristics of the dataset containing multiple categories. When analyzing the importance of channels, the influence of categories is fully considered, rather than estimating the importance of channels as a whole dataset. This allows for more accurate selection of important channels as target channels. Based on this, the model is trained to improve the accuracy and performance of the model when applied to image classification tasks.

[0135] Figure 8 A flowchart illustrating an exemplary embodiment of this application provides an image processing model training method. The execution entity of the method provided in this application is... Figure 1 The servers in the network architecture shown. For example... Figure 8 As shown, the specific steps of this method are as follows:

[0136] Step S801: Receive the dataset of the image processing task sent by the user equipment.

[0137] In practical applications, when a user wants to obtain an image processing model for performing a specific image processing task, they can acquire the dataset for the current image processing task through their user device and upload it to the server. The server receives the dataset for the image processing task sent by the user device and, through subsequent steps S802-S805, trains the image processing model based on the dataset for the current image processing task.

[0138] Step S802: Obtain the image processing model to be trained. The image processing model is obtained by inserting a channel tuning module into the conversion layer of the pre-trained model.

[0139] Step S803: Input the sample images in the dataset into the image processing model, extract features from the sample images through the transformation layer, and transform the features of at least one target channel in the intermediate feature map extracted by the transformation layer through the channel tuning module inserted in the transformation layer.

[0140] Step S804: Determine the image processing result based on the features of the final output of the conversion layer.

[0141] Step S805: Train the parameters of the channel tuning module in the image processing model based on the image processing results and the annotation information of the sample images to obtain the trained image processing model.

[0142] Step S806: Output the model parameters of the trained image processing model to the user device.

[0143] In this embodiment, the specific implementation of steps S802-S805 is described in the model training process related to steps S301-S304 in the previous embodiment, and will not be repeated here.

[0144] This embodiment provides a system architecture for image processing model training methods in practical applications.

[0145] The image processing model training method provided in this application can be applied to fields such as medicine and remote sensing where it is difficult to obtain a large amount of training data, and can be used for image processing tasks such as medical images and remote sensing images. Figure 9 The flowchart illustrates a training method for a remote sensing image processing model provided in an exemplary embodiment of this application. The execution entity of the method provided in this application is... Figure 1 The servers in the network architecture shown. For example... Figure 9 As shown, the specific steps of this method are as follows:

[0146] Step S901: Receive the remote sensing image dataset sent by the user equipment. The remote sensing image dataset contains multiple remote sensing images and their annotation information.

[0147] In this embodiment, taking the application in the field of remote sensing as an example, when a user wants to obtain an image processing model for performing a remote sensing image processing task, they can obtain the remote sensing image dataset for the current image processing task through the user device and upload the remote sensing image dataset to the server. The remote sensing image dataset contains multiple remote sensing images and their annotation information.

[0148] The server receives the remote sensing image dataset sent by the user equipment and, through subsequent steps S902-S905, trains an image processing model based on the dataset of the current image processing task.

[0149] Step S902: Obtain the image processing model to be trained. The image processing model is obtained by inserting a channel tuning module into the conversion layer of the pre-trained model.

[0150] Step S903: Input the remote sensing image into the image processing model, extract features from the remote sensing image through the transformation layer, and transform the features of at least one target channel in the intermediate feature map extracted by the transformation layer through the channel tuning module inserted in the transformation layer.

[0151] Step S904: Determine the image processing result of the remote sensing image based on the characteristics of the final output of the conversion layer.

[0152] Step S905: Based on the image processing results and annotation information of the remote sensing image, train the parameters of the channel optimization module in the image processing model to obtain the trained image processing model.

[0153] Step S906: Output the model parameters of the trained image processing model to the user device.

[0154] In this embodiment, the specific implementation of steps S902-S905 is described in the model training process related to steps S301-S304 in the previous embodiment, and will not be repeated here.

[0155] This embodiment provides a system architecture for applying image processing model training methods to remote sensing image processing tasks.

[0156] Figure 10 A flowchart illustrating an exemplary embodiment of this application shows an image processing method. The execution entity of the method provided in this application is... Figure 1 The electronic devices in the network architecture shown are responsible for performing image processing tasks using the trained image processing model. For example... Figure 10 As shown, the specific steps of this method are as follows:

[0157] Step S1001: Obtain the image to be processed.

[0158] Step S1002: Input the image into the trained image processing model, extract features from the image through the transformation layer of the image processing model, and transform the features of at least one target channel in the intermediate feature map extracted by the transformation layer through the channel tuning module inserted in the transformation layer.

[0159] In this step, the image features are extracted through the transformation layer of the image processing model, and the features of at least one target channel in the intermediate feature map extracted by the transformation layer are transformed through the channel tuning module inserted in the transformation layer. The specific implementation method is similar to the process of processing the sample image in step S302 above. Please refer to the relevant content in the above embodiment for details, which will not be repeated here.

[0160] Specifically, the image processing model can be a model that performs any one of the image processing tasks, such as image classification, image recognition, image segmentation, or image detection.

[0161] Step S1003: Determine the image processing result of the image based on the features of the final output of the conversion layer.

[0162] In this embodiment, since a channel tuning module is inserted into the transformer layer of the image processing model, after the image to be processed is input into the image processing model, during the feature extraction process of the image through the transformer layer of the image processing model, according to the insertion position of the channel tuning module in the transformer layer, the channel tuning module transforms the features of at least one target channel in the intermediate feature map generated at the insertion position to obtain intermediate transformed features. The intermediate transformed features are then input into the part after the insertion position for further processing until the image processing result is obtained.

[0163] The final output feature of the transformation layer can be the feature output by the last transformer layer in the image processing model. This feature has been transformed by the channel tuning module in one or more transformer layers.

[0164] Step S1004: Output the image processing results.

[0165] The method in this embodiment can improve the accuracy of image processing.

[0166] Figure 11 This is a schematic diagram of an image processing model training system provided in an example embodiment of this application. Figure 11 As shown, the image processing model training system includes an edge device 1101 and a cloud device 1102 communicatively connected to the edge device 1101. The edge device 1101 constructs multiple sets of training data to form a dataset for the image processing task and uploads the dataset to the cloud device 1102. The cloud device 1102 trains the image processing model based on the dataset and sends the trained model parameters to the edge device 1101 when the model's loss function converges.

[0167] Specifically, the edge device 1101 is used to construct a dataset for the image processing task and send the dataset for the image processing task to the cloud device.

[0168] The cloud-side device 1102 is used for: receiving a dataset for an image processing task; obtaining an image processing model to be trained, which is obtained by inserting a channel tuning module into the transformation layer of a pre-trained model; inputting sample images from the dataset into the image processing model; extracting features from the sample images through the transformation layer; and transforming the features of at least one target channel in the intermediate feature map extracted by the transformation layer through the channel tuning module inserted in the transformation layer; determining the image processing result based on the features finally output by the transformation layer; and training the parameters of the channel tuning module in the image processing model based on the image processing result and the annotation information of the sample images to obtain a trained image processing model.

[0169] The cloud-side device 1102 is also used to send the model parameters of the trained image processing model to the edge-side device 1101.

[0170] Optionally, the cloud-side device 1102 can also deploy the trained image processing model in a preset manner, such as deploying it to a cloud server.

[0171] In this embodiment, the end-side device 1101 can be an edge cloud device deployed at the network edge on various network platforms. It is responsible for collecting various types of data generated by terminal devices within the coverage area of ​​the end-side device, preprocessing the data, constructing a dataset for image processing tasks, and uploading the dataset to the cloud-side device. The end-side device 1101 can be a server-side device such as a conventional server, cloud server, or server array. Terminal devices include, but are not limited to, desktop computers, laptops, or smartphones.

[0172] The cloud-side device 1102 can be a central cloud device deployed in the network center on various network platforms, or a server-side device such as a conventional server, cloud server, or server array. The cloud-side device receives image processing task datasets sent by the end-side devices and integrates datasets of the same image processing task from different end-side devices to form a larger dataset. The cloud-side device trains its image processing model based on this integrated large dataset.

[0173] In this embodiment, the cloud-side device 1102 trains a trained image processing model based on the dataset of the image processing task. For details, please refer to the relevant content of steps S301-S304 in the above method embodiment, which will not be repeated here.

[0174] Figure 12 This is a schematic diagram of an image processing model training apparatus provided in an example embodiment of this application. The apparatus provided in this embodiment is used to perform the above-described image processing model training method. Figure 12 As shown, the image processing model training device 120 includes: a data acquisition unit 1201, an image processing unit 1202, and a parameter training unit 1203.

[0175] The data acquisition unit 1201 is used to acquire the dataset for the image processing task and the image processing model to be trained. The image processing model is obtained by inserting a channel tuning module into the conversion layer of the pre-trained model.

[0176] The image processing unit 1202 is used to input sample images from the dataset into the image processing model, extract features from the sample images through the transformation layer, and transform the features of at least one target channel in the intermediate feature map extracted by the transformation layer through the channel tuning module inserted in the transformation layer; and determine the image processing result based on the features finally output by the transformation layer.

[0177] The parameter training unit 1203 is used to train the parameters of the channel tuning module in the image processing model based on the image processing results and the annotation information of the sample images, so as to obtain the trained image processing model. The trained image processing model is used to process the input image to obtain the image processing results.

[0178] In an optional embodiment, when implementing the transformation processing of features of at least one target channel in the intermediate feature map extracted by the transformation layer through the channel tuning module inserted in the transformation layer, the image processing unit 1202 is further configured to:

[0179] Based on the insertion position of the channel tuning module in the transformation layer, features of the target channel are extracted from the intermediate feature map obtained at the insertion position in the transformation layer. The extracted features are linearly mapped by the channel tuning module to obtain mapped features, and the mapped features are fused with the extracted features to obtain fused features. The features of the target channel in the intermediate feature map are replaced with the fused features to obtain intermediate transformation features. The intermediate transformation features are input into the part after the insertion position in the image processing model for subsequent image processing to determine the image processing result of the sample image.

[0180] In an optional embodiment, when fusing the mapped features with the extracted features to obtain the fused features, the image processing unit 1202 is further configured to:

[0181] The mapped features are added to the extracted features to obtain the fused features;

[0182] or,

[0183] Based on the preset weight coefficients, the mapped features and the extracted features are weighted and summed to obtain the fused features.

[0184] In one alternative embodiment, the conversion layer includes a first sub-layer and a second sub-layer, the output of the first sub-layer is the input of the second sub-layer, and the channel tuning module is inserted in the conversion layer between the first sub-layer and the second sub-layer.

[0185] When extracting features of the target channel from the intermediate feature map obtained at the insertion position in the conversion layer, the image processing unit 1202 is also used to: extract features of the target channel from the intermediate feature map output from the first sub-layer of the conversion layer.

[0186] In one alternative embodiment, the conversion layer includes a first sub-layer and a second sub-layer, the output of the first sub-layer is the input of the second sub-layer, and the channel tuning module is inserted in the conversion layer after the second sub-layer.

[0187] When extracting features of the target channel from the intermediate feature map obtained at the insertion position in the transformation layer, the image processing unit 1202 is also used to: extract features of the target channel from the intermediate feature map output from the second sub-layer.

[0188] In an optional embodiment, the image processing model training apparatus 120 further includes a channel selection unit. The channel selection unit is used for:

[0189] Based on the dataset of the image processing task, determine the feature weights of each channel in the conversion layer of the pre-trained model when applied to the image processing task; based on the feature weights of each channel in the conversion layer, select a preset number of channels as the target channels of the conversion layer, with the preset number being greater than or equal to 1.

[0190] In an optional embodiment, when determining the feature weights of each channel in the transformation layer of the pre-trained model applied to the image processing task based on the dataset of the image processing task, the channel selection unit is further configured to:

[0191] The sample images in the dataset are input into the pre-trained model, and the sample images are extracted through the transformation layer of the pre-trained model. According to the insertion position of the channel tuning module in the image processing model in the transformation layer, the intermediate feature map extracted by the transformation layer of the pre-trained model is obtained. The L2 normalized value of each channel feature in the intermediate feature map extracted by the transformation layer is used as the feature weight of each channel in the transformation layer.

[0192] In one optional embodiment, the image processing task is an image classification task, and the annotation information of the sample image is the category information of the sample image.

[0193] In determining the feature weights of each channel in the transformation layer of a pre-trained model applied to an image processing task based on the dataset of the image processing task, the channel selection unit is also used for:

[0194] Based on the category information of the sample images in the dataset, the sample images of each category are input into the pre-trained model, and the sample images are used to extract features through the transformation layer of the pre-trained model. According to the insertion position of the channel tuning module in the image processing model in the transformation layer, the intermediate feature map extracted by the transformation layer of the pre-trained model is obtained. The L2 normalized value of the feature of each channel in the intermediate feature map extracted by the transformation layer is used as the feature weight of each channel in the transformation layer corresponding to that category. The mean of the feature weights of each channel in the transformation layer corresponding to each category is calculated and used as the feature weight of each channel in the transformation layer.

[0195] In an optional embodiment, after obtaining the trained image processing model, the method further includes: acquiring the image to be processed in the image processing task, inputting the image to be processed into the trained image processing model for image processing, and obtaining the image processing result.

[0196] In one optional embodiment, after obtaining the trained image processing model, the method further includes: outputting the model parameters of the trained image processing model to the edge device.

[0197] The apparatus provided in this embodiment can be used to execute the image processing model training method based on any of the above embodiments. The specific functions and technical effects that can be achieved will not be described in detail here.

[0198] Figure 13 This is a schematic diagram of an image processing apparatus provided in an example embodiment of this application. The apparatus provided in this embodiment is used to perform the above-described image processing method. Figure 13 As shown, the image processing device 130 includes: an image acquisition unit 1301, an image processing unit 1302, and a processing result output unit 1303.

[0199] The image acquisition unit 1301 is used to acquire the image to be processed.

[0200] The image processing unit 1302 is used to input the image into the trained image processing model, extract features from the image through the transformation layer of the image processing model, and transform the features of at least one target channel in the intermediate feature map extracted by the transformation layer through the channel tuning module inserted in the transformation layer; and determine the image processing result of the image based on the features finally output by the transformation layer.

[0201] The processing result output unit 1303 is used to output the image processing results.

[0202] The apparatus provided in this embodiment can be used to execute the image processing method based on any of the above embodiments. The specific functions and technical effects that can be achieved will not be described in detail here.

[0203] Figure 14This is a schematic diagram of the structure of an electronic device provided in an example embodiment of this application. For example... Figure 14 As shown, the electronic device 140 includes a processor 1401 and a memory 1402 communicatively connected to the processor 1401, the memory 1402 storing computer execution instructions.

[0204] The processor executes computer execution instructions stored in the memory to implement the solution provided in any of the above method embodiments. The specific functions and technical effects that can be achieved will not be elaborated here.

[0205] This application also provides a computer-readable storage medium storing computer-executable instructions. When executed by a processor, the computer-executable instructions are used to implement the solution provided in any of the above method embodiments. The specific functions and technical effects to be achieved are not described here.

[0206] This application also provides a computer program product, which includes a computer program stored in a readable storage medium. At least one processor of the electronic device can read the computer program from the readable storage medium. The at least one processor executes the computer program to cause the electronic device to perform the solution provided in any of the above method embodiments. The specific functions and technical effects that can be achieved are not described here.

[0207] Furthermore, in some of the processes described in the above embodiments and accompanying drawings, multiple operations appear in a specific order. However, it should be clearly understood that these operations may not be executed in the order they appear herein, or may be executed in parallel. The sequence numbers are merely used to distinguish different operations, and the sequence number itself does not represent any execution order. Additionally, these processes may include more or fewer operations, and these operations may be executed sequentially or in parallel. It should be noted that the descriptions such as "first," "second," etc., in this document are used to distinguish different messages, devices, modules, etc., and do not represent a sequential order, nor do they limit "first" and "second" to different types. "Multiple" means two or more, unless otherwise explicitly specified.

[0208] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this application are indicated by the following claims.

[0209] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.

Claims

1. A method for training an image processing model, characterized in that, include: The dataset for the image processing task and the image processing model to be trained are obtained. The image processing model is obtained by inserting a channel tuning module into the transformation layer of the pre-trained model. The transformation layer is a transformer layer and the channel tuning module is a linear mapping layer. The sample images in the dataset are input into the image processing model. The transformation layer extracts features from the sample images. The channel tuning module inserted in the transformation layer transforms the features of at least one target channel in the intermediate feature map extracted by the transformation layer. The image processing result is determined based on the features of the final output of the conversion layer; Based on the image processing results and the annotation information of the sample images, the parameters of the channel tuning module in the image processing model are trained to obtain a trained image processing model. The trained image processing model is used to process the input image to obtain the image processing results. After obtaining the dataset for the image processing task, the process also includes: The sample images in the dataset are input into the pre-trained model, and the sample images are used to extract features through the transformation layer of the pre-trained model; Based on the insertion position of the channel tuning module in the conversion layer of the image processing model, the intermediate feature map extracted by the conversion layer of the pre-trained model is obtained; The L2 normalized value of each channel's feature in the intermediate feature map extracted by the conversion layer is used as the feature weight of each channel in the conversion layer. Based on the feature weights of each channel in the conversion layer, a preset number of channels are selected as the target channels of the conversion layer, wherein the preset number is greater than or equal to 1.

2. The method according to claim 1, characterized in that, The channel tuning module inserted in the transformation layer performs transformation processing on the features of at least one target channel in the intermediate feature map extracted by the transformation layer, including: Based on the insertion position of the channel tuning module in the conversion layer, the features of the target channel are extracted from the intermediate feature map obtained at the insertion position in the conversion layer; The extracted features are linearly mapped by the channel optimization module to obtain mapped features, and the mapped features are fused with the extracted features to obtain fused features. The features of the target channel in the intermediate feature map are replaced with the fused features to obtain intermediate transformation features; The intermediate transformation features are input into the portion after the insertion position in the image processing model for subsequent image processing to determine the image processing result of the sample image.

3. The method according to claim 2, characterized in that, The process of fusing the mapped features with the extracted features to obtain the fused features includes: The mapped features are added to the extracted features to obtain the fused features; or, Based on the preset weight coefficients, the mapped features and the extracted features are weighted and summed to obtain the fused features.

4. The method according to claim 2, characterized in that, The transformation layer comprises a first sub-layer and a second sub-layer, with the output of the first sub-layer serving as the input to the second sub-layer. The channel optimization module is inserted in the conversion layer between the first sub-layer and the second sub-layer. Extracting features of the target channel from the intermediate feature map obtained at the insertion position in the transformation layer includes: Features of the target channel are extracted from the intermediate feature map output from the first sub-layer of the transformation layer.

5. The method according to claim 2, characterized in that, The transformation layer comprises a first sub-layer and a second sub-layer, with the output of the first sub-layer serving as the input to the second sub-layer. The channel optimization module is inserted in the conversion layer after the second sub-layer. Extracting features of the target channel from the intermediate feature map obtained at the insertion position in the transformation layer includes: Features of the target channel are extracted from the intermediate feature map output from the second sub-layer.

6. The method according to claim 1, characterized in that, The image processing task is an image classification task, and the annotation information of the sample images is the category information of the sample images. The step of inputting sample images from the dataset into the pre-trained model and extracting features from the sample images through the transformation layer of the pre-trained model includes: Based on the category information of the sample images in the dataset, the sample images of each category are input into the pre-trained model, and the sample images are used to extract features through the transformation layer of the pre-trained model; Accordingly, the step of using the L2 normalized value of each channel's feature in the intermediate feature map extracted by the transformation layer as the feature weight of each channel in the transformation layer includes: The L2 normalized value of each channel's feature in the intermediate feature map extracted by the transformation layer is used as the feature weight of each channel in the transformation layer corresponding to that category. The mean value of the feature weights corresponding to each category for each channel in the conversion layer is calculated and used as the feature weight of each channel in the conversion layer.

7. The method according to any one of claims 1-5, characterized in that, After obtaining the trained image processing model, the following is also included: The image to be processed in the image processing task is obtained, and the image to be processed is input into the trained image processing model to perform image processing and obtain the image processing result. or, The model parameters of the trained image processing model are output to the edge device.

8. A training method for a remote sensing image processing model, characterized in that, include: Receive a remote sensing image dataset sent by a user equipment, wherein the remote sensing image dataset contains multiple remote sensing images and annotation information of the remote sensing images; An image processing model to be trained is obtained by inserting a channel tuning module into the transformation layer of a pre-trained model; wherein the transformation layer is a transformer layer and the channel tuning module is a linear mapping layer. The remote sensing image is input into the image processing model, and features are extracted from the remote sensing image through the transformation layer. The channel optimization module inserted in the transformation layer transforms the features of at least one target channel in the intermediate feature image extracted by the transformation layer. Based on the features of the final output of the conversion layer, the image processing result of the remote sensing image is determined; Based on the image processing results and annotation information of the remote sensing image, the parameters of the channel tuning module in the image processing model are trained to obtain the trained image processing model; The model parameters of the trained image processing model are output to the user device; After receiving the remote sensing image dataset sent by the user equipment, the method further includes: The remote sensing images in the remote sensing image dataset are input into the pre-trained model, and the features of the remote sensing images are extracted through the transformation layer of the pre-trained model. Based on the insertion position of the channel tuning module in the conversion layer of the image processing model, the intermediate feature map extracted by the conversion layer of the pre-trained model is obtained; The L2 normalized value of each channel's feature in the intermediate feature map extracted by the conversion layer is used as the feature weight of each channel in the conversion layer. Based on the feature weights of each channel in the conversion layer, a preset number of channels are selected as the target channels of the conversion layer, wherein the preset number is greater than or equal to 1.

9. An image processing method, characterized in that, include: Obtain the image to be processed; The image is input into a trained image processing model, and features are extracted from the image through the transformation layer of the image processing model. The channel tuning module inserted in the transformation layer transforms the features of at least one target channel in the intermediate feature map extracted by the transformation layer. The image processing model is obtained based on the image processing model training method according to any one of claims 1-7. The image processing result of the image is determined based on the features of the final output of the conversion layer; Output the image processing results.

10. An image processing model training system, characterized in that, include: The edge device is used to construct a dataset for the image processing task and send the dataset for the image processing task to the cloud-side device; The cloud-side device is used to receive a dataset for an image processing task, obtain an image processing model to be trained, and obtain the image processing model by inserting a channel tuning module into the transformation layer of a pre-trained model; input sample images from the dataset into the image processing model, extract features from the sample images through the transformation layer, and transform the features of at least one target channel in the intermediate feature map extracted by the transformation layer through the channel tuning module inserted in the transformation layer; Based on the features of the final output of the transformation layer, the image processing result is determined; based on the image processing result and the annotation information of the sample image, the parameters of the channel tuning module in the image processing model are trained to obtain the trained image processing model; wherein, the transformation layer is a transformer layer and the channel tuning module is a linear mapping layer; The cloud-side device is also used to send the model parameters of the trained image processing model to the end-side device; The cloud-side device is further configured to input sample images from the dataset into the pre-trained model, extract features from the sample images through the conversion layer of the pre-trained model; obtain intermediate feature maps extracted by the conversion layer of the pre-trained model according to the insertion position of the channel tuning module in the conversion layer of the image processing model; use the L2 normalized value of the features of each channel in the intermediate feature map extracted by the conversion layer as the feature weight of each channel in the conversion layer; and select a preset number of channels as the target channels of the conversion layer according to the feature weights of each channel in the conversion layer, wherein the preset number is greater than or equal to 1.

11. An electronic device, characterized in that, include: A processor, and a memory communicatively connected to the processor; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory to implement the method as described in any one of claims 1-9.

12. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1-9.