An image processing method, apparatus and device

By acquiring image features during image processing, performing a first convolutional process, and determining a spatial mask, and then combining the spatial mask with a second convolutional process, the problem of redundant computation in deep convolutional neural networks is solved, achieving efficient image processing.

CN116883675BActive Publication Date: 2026-06-26CHINA MOBILE COMM LTD RES INST +2

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA MOBILE COMM LTD RES INST
Filing Date
2022-03-28
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing deep convolutional neural networks suffer from resource waste and redundant computation during image recognition because they perform convolution operations at the same spatial resolution on different spatial locations of the same image.

Method used

The first convolution process is performed by acquiring image features to determine the spatial mask of each sub-image feature. The second convolution process is then performed by combining the spatial mask with the features. Finally, feature fusion is performed to avoid redundant calculations.

Benefits of technology

It improves the efficiency of image processing and avoids resource waste by enhancing the efficiency of image processing through local convolution and recombination.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116883675B_ABST
    Figure CN116883675B_ABST
Patent Text Reader

Abstract

The application provides an image processing method, device and equipment, and relates to the technical field of communication. The method comprises the following steps: acquiring a first image feature; performing first convolution processing according to the first image feature to obtain a second image feature; determining a corresponding spatial mask according to each sub-image feature in the first convolution processing; performing second convolution processing according to the first image feature and the spatial mask to obtain a third image feature; and performing feature fusion according to the second image feature and the third image feature to obtain a target image feature. The scheme of the application solves the problem of resource waste in the image processing process.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of communication technology, and in particular to an image processing method, apparatus, and device. Background Technology

[0002] Existing image and video content recognition technologies primarily rely on deep convolutional neural networks. However, current deep convolutional neural networks represent different spatial locations of the same image using the same spatial resolution and assign the same convolution operations.

[0003] Thus, there is a lot of redundant computation in the image recognition process, resulting in a waste of resources. Summary of the Invention

[0004] The purpose of this invention is to provide an image processing method, apparatus, and device to avoid resource waste during the image processing process.

[0005] To achieve the above objectives, embodiments of the present invention provide an image processing method applied to an image processing device, comprising:

[0006] Obtain the first image features;

[0007] The first image features are processed by a first convolution to obtain the second image features;

[0008] Based on the features of each sub-image in the first convolutional process, determine the corresponding spatial mask;

[0009] A second convolution process is performed based on the first image features and the spatial mask to obtain the third image features;

[0010] The target image features are obtained by fusing the second image features and the third image features.

[0011] Optionally, the step of performing a first convolution process based on the first image features to obtain second image features includes:

[0012] The first image feature is downsampled to obtain the fourth image feature;

[0013] The fourth image feature is subjected to multiple convolution operations to obtain the second image feature.

[0014] Optionally, the sub-image features include the fourth image features and the result of each convolution operation in the multiple convolution operations.

[0015] Optionally, determining the corresponding spatial mask based on each sub-image feature in the first convolutional processing includes:

[0016] Adaptive average pooling and convolution operations are performed on the sub-image features to obtain the initial mask;

[0017] The initial mask is binarized to obtain the spatial mask of the sub-image features.

[0018] Optionally, the step of performing a second convolution process based on the first image features and the spatial mask to obtain the third image features includes:

[0019] The channel dimension of the first image feature is divided into G groups, and the spatial dimension is divided into K×K image blocks, resulting in G×K×K image blocks; where G and K are integers greater than or equal to 1.

[0020] Based on the corresponding spatial mask, multiple convolution operations are performed on each channel group for the G×K×K image blocks to obtain G calculation results;

[0021] The G calculation results are concatenated along the channel dimension to obtain the third image feature.

[0022] Optionally, the step of fusing features based on the second image features and the third image features to obtain target image features includes:

[0023] Based on the second image features, determine channel attention;

[0024] The target image features are obtained based on the second image features, the third image features, and the channel attention.

[0025] Optionally, obtaining the target image features based on the second image features, the third image features, and the channel attention includes:

[0026] Through formula Calculate the target image feature y; where y base Representing the second image feature, y refine The third image feature is represented by α, and the channel attention is represented by α.

[0027] Optionally, after fusing features based on the second image features and the third image features to obtain the target image features, the method further includes:

[0028] The target image features are downsampled to obtain image features with the target spatial resolution.

[0029] Optionally, the image processing device includes a first part, a second part, and a third part;

[0030] The first part is used to perform a first convolution process on the first image features, and the first part includes L-1 concatenated first convolution modules; the second part is used to perform region selection on the sub-image features output by the L-1 first convolution modules, and the second part includes L-1 region selection modules, each of which is connected to a corresponding first convolution module; the third part is used to perform a second convolution process on the first image features, and the third part includes L-1 concatenated second convolution modules, each of which is connected to a corresponding region selection module.

[0031] L is an integer greater than or equal to 2.

[0032] Optionally, the method further includes:

[0033] The network is trained based on the cross-entropy loss of classification and pre-set region selection parameters; wherein, the region selection parameters represent the proportion of regions selected by the region selection module.

[0034] Optionally, the network training based on classification-based cross-entropy loss and pre-set region selection parameters includes:

[0035] Through the loss function L = L cls +λL sparse Conduct network training;

[0036] Where L represents network loss; L cls λ represents the cross-entropy loss for classification; λ represents the region selection parameter. N represents the number of region selection modules, B represents the number of training samples, n represents the nth region selection module, and r b,n This represents the proportion of element 1 in the spatial mask obtained by the nth region selection module for the b-th training sample, and t represents the region selection parameter.

[0037] To achieve the above objectives, embodiments of the present invention provide an image processing apparatus, comprising:

[0038] The acquisition module is used to acquire the first image features;

[0039] The first processing module is used to perform a first convolution process based on the first image features to obtain the second image features;

[0040] The second processing module is used to determine the corresponding spatial mask based on each sub-image feature in the first convolutional processing;

[0041] The third processing module is used to perform a second convolution process based on the first image features and the spatial mask to obtain the third image features.

[0042] The fourth processing module is used to perform feature fusion based on the second image features and the third image features to obtain the target image features.

[0043] Optionally, the first processing module includes:

[0044] The first processing submodule is used to downsample the first image features to obtain the fourth image features;

[0045] The second processing submodule is used to perform multiple convolution operations on the fourth image features to obtain the second image features.

[0046] Optionally, the sub-image features include the fourth image features and the result of each convolution operation in the multiple convolution operations.

[0047] Optionally, the second processing module includes:

[0048] The third processing submodule is used to perform adaptive average pooling and convolution operations on the sub-image features to obtain an initial mask;

[0049] The fourth processing submodule is used to binarize the initial mask to obtain the spatial mask of the sub-image features.

[0050] Optionally, the third processing module includes:

[0051] The fifth processing submodule is used to divide the channel dimension of the first image feature into G groups and the spatial dimension into K×K image blocks to obtain G×K×K image blocks; where G and K are integers greater than or equal to 1.

[0052] The sixth processing submodule is used to perform multiple convolution operations on each channel group of the G×K×K image blocks based on the corresponding spatial mask to obtain G calculation results;

[0053] The seventh processing submodule is used to concatenate the G calculation results along the channel dimension to obtain the third image feature.

[0054] Optionally, the fourth processing module includes:

[0055] The eighth processing submodule is used to determine channel attention based on the second image features;

[0056] The ninth processing submodule is used to obtain the target image features based on the second image features, the third image features, and the channel attention.

[0057] Optionally, the ninth processing submodule is further configured to:

[0058] Through formula Calculate the target image feature y; where y base Representing the second image feature, y refine The third image feature is represented by α, and the channel attention is represented by α.

[0059] Optionally, the device further includes:

[0060] The fifth processing module is used to downsample the target image features to obtain image features with the target spatial resolution.

[0061] Optionally, the image processing device includes a first part, a second part, and a third part;

[0062] The first part is used to perform a first convolution process on the first image features, and the first part includes L-1 concatenated first convolution modules; the second part is used to perform region selection on the sub-image features output by the L-1 first convolution modules, and the second part includes L-1 region selection modules, each of which is connected to a corresponding first convolution module; the third part is used to perform a second convolution process on the first image features, and the third part includes L-1 concatenated second convolution modules, each of which is connected to a corresponding region selection module.

[0063] L is an integer greater than or equal to 2.

[0064] Optionally, the device further includes:

[0065] The sixth processing module is used to train the network based on the cross-entropy loss of classification and the pre-set region selection parameters; wherein the region selection parameters represent the proportion of regions selected by the region selection module.

[0066] Optionally, the sixth processing module is further configured to:

[0067] Through the loss function L = L cls +λL sparse Conduct network training;

[0068] Where L represents network loss; L cls λ represents the cross-entropy loss for classification; λ represents the region selection parameter. N represents the number of region selection modules, B represents the number of training samples, n represents the nth region selection module, and r b,n This represents the proportion of element 1 in the spatial mask obtained by the nth region selection module for the b-th training sample, and t represents the region selection parameter.

[0069] To achieve the above objectives, embodiments of the present invention provide an image processing apparatus, including a processor, the processor being used for:

[0070] Obtain the first image features;

[0071] The first image features are processed by a first convolution to obtain the second image features;

[0072] Based on the features of each sub-image in the first convolutional process, determine the corresponding spatial mask;

[0073] A second convolution process is performed based on the first image features and the spatial mask to obtain the third image features;

[0074] The target image features are obtained by fusing the second image features and the third image features.

[0075] Optionally, the processor is further configured to:

[0076] The first image feature is downsampled to obtain the fourth image feature;

[0077] The fourth image feature is subjected to multiple convolution operations to obtain the second image feature.

[0078] Optionally, the sub-image features include the fourth image features and the result of each convolution operation in the multiple convolution operations.

[0079] Optionally, the processor is further configured to:

[0080] Adaptive average pooling and convolution operations are performed on the sub-image features to obtain the initial mask;

[0081] The initial mask is binarized to obtain the spatial mask of the sub-image features.

[0082] Optionally, the processor is further configured to:

[0083] The channel dimension of the first image feature is divided into G groups, and the spatial dimension is divided into K×K image blocks, resulting in G×K×K image blocks; where G and K are integers greater than or equal to 1.

[0084] Based on the corresponding spatial mask, multiple convolution operations are performed on each channel group for the G×K×K image blocks to obtain G calculation results;

[0085] The G calculation results are concatenated along the channel dimension to obtain the third image feature.

[0086] Optionally, the processor is further configured to:

[0087] Based on the second image features, determine channel attention;

[0088] The target image features are obtained based on the second image features, the third image features, and the channel attention.

[0089] Optionally, the processor is further configured to:

[0090] Through formula Calculate the target image feature y; where y base Representing the second image feature, y refine The third image feature is represented by α, and the channel attention is represented by α.

[0091] Optionally, the processor is further configured to:

[0092] The target image features are downsampled to obtain image features with the target spatial resolution.

[0093] Optionally, the image processing device includes a first part, a second part, and a third part;

[0094] The first part is used to perform a first convolution process on the first image features, and the first part includes L-1 concatenated first convolution modules; the second part is used to perform region selection on the sub-image features output by the L-1 first convolution modules, and the second part includes L-1 region selection modules, each of which is connected to a corresponding first convolution module; the third part is used to perform a second convolution process on the first image features, and the third part includes L-1 concatenated second convolution modules, each of which is connected to a corresponding region selection module.

[0095] L is an integer greater than or equal to 2.

[0096] Optionally, the processor is further configured to:

[0097] The network is trained based on the cross-entropy loss of classification and pre-set region selection parameters; wherein, the region selection parameters represent the proportion of regions selected by the region selection module.

[0098] Optionally, the processor is further configured to:

[0099] Through the loss function L = L cls +λL sparse Conduct network training;

[0100] Where L represents network loss; L cls λ represents the cross-entropy loss for classification; λ represents the region selection parameter. N represents the number of region selection modules, B represents the number of training samples, n represents the nth region selection module, and r b,nrepresents the proportion of element 1 in the spatial mask obtained by the nth region selection module for the bth training sample, and t represents the region selection parameter.

[0101] To achieve the above objectives, embodiments of the present invention provide an image processing device, including a transceiver, a processor, a memory, and a program or instructions stored in the memory and executable on the processor; when the processor executes the program or instructions, it implements the image processing method described above.

[0102] To achieve the above objectives, embodiments of the present invention provide a readable storage medium having a program or instructions stored thereon, which, when executed by a processor, implement the steps in the image processing method described above.

[0103] The beneficial effects of the above-described technical solution of the present invention are as follows:

[0104] The method of this invention, after obtaining the first image feature, first performs a first convolution process based on the first image feature to obtain a second image feature. Then, based on each sub-image feature in the first convolution process, a spatial mask corresponding to each sub-image feature is determined. The first image feature and the spatial mask are then combined to perform a second convolution process to obtain a third image feature. Finally, feature fusion is performed based on the obtained second and third image features to obtain the desired target image feature for subsequent processing. In this way, by using two branches, one performs efficient feature extraction, and the other uses a spatial mask to perform only local convolution before recombination, the efficiency of image processing is greatly improved, and resource waste caused by redundant calculations is avoided. Attached Figure Description

[0105] Figure 1 This is a flowchart of an image processing method according to an embodiment of the present invention;

[0106] Figure 2 The module structure of the image processing device according to an embodiment of the present invention;

[0107] Figure 3 This is a structural diagram of the image processing device according to an embodiment of the present invention;

[0108] Figure 4 This is a structural diagram of an image processing device according to another embodiment of the present invention. Detailed Implementation

[0109] To make the technical problems, technical solutions and advantages of the present invention clearer, a detailed description will be given below in conjunction with the accompanying drawings and specific embodiments.

[0110] It should be understood that the phrase "one embodiment" or "an embodiment" throughout the specification means that a specific feature, structure, or characteristic related to the embodiment is included in at least one embodiment of the invention. Therefore, "in one embodiment" or "in an embodiment" appearing throughout the specification do not necessarily refer to the same embodiment. Furthermore, these specific features, structures, or characteristics can be combined in any suitable manner in one or more embodiments.

[0111] In various embodiments of the present invention, it should be understood that the sequence number of each process described below does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

[0112] In addition, the terms "system" and "network" are often used interchangeably in this article.

[0113] In the embodiments provided in this application, it should be understood that "B corresponding to A" means that B is associated with A, and B can be determined based on A. However, it should also be understood that determining B based on A does not mean determining B solely based on A; B can also be determined based on A and / or other information.

[0114] like Figure 1 As shown, an image processing method according to an embodiment of the present invention is applied to an image processing device, comprising:

[0115] Step 101: Obtain the first image features.

[0116] Step 102: Perform a first convolution process based on the first image features to obtain the second image features.

[0117] Step 103: Determine the corresponding spatial mask based on the features of each sub-image in the first convolutional processing;

[0118] Step 104: Perform a second convolution process based on the first image features and the spatial mask to obtain the third image features;

[0119] Step 105: Perform feature fusion based on the second image features and the third image features to obtain the target image features.

[0120] Here, the first image feature is the image feature extracted after preliminary processing of the current image to be processed (such as the image to be identified).

[0121] Thus, following the steps described above, the method of this embodiment of the invention, after obtaining the first image feature, first performs a first convolution process based on the first image feature to obtain a second image feature, and determines the spatial mask corresponding to each sub-image feature in the first convolution process, thereby combining the first image feature and the spatial mask to perform a second convolution process to obtain a third image feature. Finally, feature fusion is performed based on the obtained second image feature and third image feature to obtain the desired target image feature, thereby performing subsequent processing. In this way, by using two branches, one performs efficient feature extraction, and the other uses the spatial mask to perform only local convolution, and then reassembles, the efficiency of image processing is greatly improved, and resource waste caused by redundant calculations is avoided.

[0122] Optionally, in this embodiment, the image processing device includes a first part, a second part, and a third part;

[0123] The first part is used to perform a first convolution process on the first image features, and the first part includes L-1 concatenated first convolution modules; the second part is used to perform region selection on the sub-image features output by the L-1 first convolution modules, and the second part includes L-1 region selection modules, each of which is connected to a corresponding first convolution module; the third part is used to perform a second convolution process on the first image features, and the third part includes L-1 concatenated second convolution modules, each of which is connected to a corresponding region selection module.

[0124] L is an integer greater than or equal to 2.

[0125] Thus, after the first image feature is input into the first part, the second image feature is output. After the first image feature is input into the first part, the L-1 first convolutional modules of the first part also output sub-image features. The region selection module corresponding to each first convolutional module in the second part can perform region selection on the corresponding sub-image feature and then input it into the corresponding second convolutional module in the third part. After the first image feature is input into the third part, the third image feature is output.

[0126] The image processing device further includes a fourth part for fusing the second and third image features to obtain the target image features. Therefore, the image processing device can complete one stage of convolution operation using the first, second, third, and fourth parts.

[0127] It should also be understood that image processing includes multiple stages to obtain image features at a predetermined spatial scale. These multiple stages can be configured as the first, second, third, and fourth parts described above to perform steps 101-105.

[0128] Optionally, step 102 includes:

[0129] The first image feature is downsampled to obtain the fourth image feature;

[0130] The fourth image feature is subjected to multiple convolution operations to obtain the second image feature.

[0131] The first convolution process downsamples the first image features to obtain a fourth image feature with a lower spatial resolution. Then, multiple convolution operations are performed based on the obtained fourth image feature to obtain the second image feature. Of course, to ensure that the second image feature recovers the same spatial resolution as the first image feature, the fourth image feature is upsampled after multiple convolution operations.

[0132] In the L-1 convolutional modules connected in series, the first convolutional module downsamples the first image features to obtain the fourth image features, and then the L-2 convolutional modules perform convolution calculations and long-sample to obtain the second image features.

[0133] For example, the first image feature x is input into the first convolutional module. C represents the number of channels, H represents the height of the feature in the spatial dimension, and W represents the width of the feature in the spatial dimension. The first convolutional module consists of convolutional modules with a stride of s (s=2) from a static network (such as ResNet). For example, the first convolutional module can be composed of two 3×3 convolutional modules, or it can be composed of three cascaded convolutional modules: 1×1, 3×3, and 1×1, thereby reducing the first image feature x to... The spatial scale yields the fourth image feature. The remaining L-2 first convolutional modules are all convolutional modules with a stride of 1. After the fourth image feature passes through the remaining L-2 first convolutional modules, it is upsampled by a factor of s to obtain the output of the low-resolution branch, which is the second image feature y. base At this point, the spatial resolution of the second image feature is consistent with the spatial resolution of the input first image feature, which is H×W.

[0134] In addition, in this embodiment, the sub-image features are the outputs of each first convolutional module. Therefore, optionally, in this embodiment, the sub-image features include the fourth image features and the result of each convolution operation in the multiple convolution operations.

[0135] That is, the sub-image features in the first convolutional processing include the fourth image feature output by the first first convolutional module, and the convolution operation results output by the other first convolutional modules. Specifically, the output x of the l-th (l = 1, 2, ..., L-1) first convolutional module l Its spatial resolution is

[0136] Optionally, step 103 includes:

[0137] Adaptive average pooling and convolution operations are performed on the sub-image features to obtain the initial mask;

[0138] The initial mask is binarized to obtain the spatial mask of the sub-image features.

[0139] Here, for each sub-image feature, an initial mask is obtained through adaptive average pooling and convolution operations. This initial mask is then binarized to obtain a spatial mask for each sub-image feature. This spatial mask is used in the second convolution process to determine which locations of the input features need to be processed at high spatial resolution.

[0140] The region selection module in the second part can be composed of adaptive average pooling and two 1×1 convolutional layers. Specifically, the output x of the l-th first convolutional module... l After the input is passed to the corresponding l-th region selection module, the region selection module outputs the initial mask. G represents the number of groups divided by the channel dimension of the first image feature, and K×K represents the number of image patches divided by the feature in the spatial dimension. After binarization, the spatial mask M is obtained. l M l ∈{0,1} G×K×K Each element determines whether the image region at spatial location (i, j) in the g-th (g = 1, 2, ..., G) channel group is selected in the second convolution process. Here, 0 indicates unselected, and 1 indicates selected. i = 1, 2, ..., K; j = 1, 2, ..., K.

[0141] During the training phase, the initial mask output by the region selection module is binarized using the Gumbel Softmax reparameterization technique, which will not be elaborated here.

[0142] Alternatively, in this embodiment, step 104 includes:

[0143] The channel dimension of the first image feature is divided into G groups, and the spatial dimension is divided into K×K image blocks, resulting in G×K×K image blocks; where G and K are integers greater than or equal to 1.

[0144] Based on the corresponding spatial mask, multiple convolution operations are performed on each channel group for the G×K×K image blocks to obtain G calculation results;

[0145] The G calculation results are concatenated along the channel dimension to obtain the third image feature.

[0146] The second convolution process, for the first image features Divide it into G groups in the channel dimension and K×K image blocks in the spatial dimension, resulting in G×K×K images of shape. The image blocks are then processed by L-1 second convolutional modules, using corresponding spatial masks to obtain the outputs of these image blocks on G channel groups. Finally, the third image feature is obtained by concatenating the images along the channel dimensions.

[0147] Taking the l-th second convolutional module as an example, in this l-th second convolutional module, for the g-th channel group, the spatial mask M obtained by outputting and binarizing the l-th region selection module is... l The position of element 1 in the matrix provides the input for the convolution operation. Assume M... l One element in the middle is B. l,g (B l,g ≤K×K) elements, the input for the convolution operation is x. l,g , At this point, the convolution kernel U of the convolution operation is also divided into G groups along the channel dimension. The weight U(g) of the g-th group is extracted and compared with x. l,g Perform convolution operation and add it to the jump connection of the input to obtain the g-th input x of the (l+1)-th second convolutional module. l+1,g Let g = 1, 2, ..., G respectively to obtain all G sets of outputs (i.e., calculation results). Concatenate these G calculation results along the channel dimension to obtain the third image feature y. refine y refine Shape and y base They have the same shape. That is, the spatial resolution of the third image feature is the same as that of the second image feature.

[0148] Optionally, step 105 includes:

[0149] Based on the second image features, determine channel attention;

[0150] The target image features are obtained based on the second image features, the third image features, and the channel attention.

[0151] In other words, channel attention must first be determined based on the second image features, and then channel attention is used to fuse the second and third image features to obtain the target image features.

[0152] Wherein, channel attention α (α∈[0,1]) C (where C is the number of channels for the second image feature) is generated by inputting the second image feature into a lightweight module. The lightweight module consists of a global average pooling layer, a multilayer perceptron (MLP), and a sigmoid activation function.

[0153] Optionally, obtaining the target image features based on the second image features, the third image features, and the channel attention includes:

[0154] Through formula Calculate the target image feature y; where y base Representing the second image feature, y refine The third image feature is represented by α, and the channel attention is represented by α.

[0155] Thus, by using the generated channel attention α, the second and third image features are substituted into the above formula to obtain the target image features y.

[0156] Additionally, considering the processing in the next stage, optionally, in this embodiment, step 105 may be followed by:

[0157] The target image features are downsampled to obtain image features with the target spatial resolution.

[0158] Here, the target spatial resolution is determined by the processing requirements of the next stage.

[0159] Specifically, the target image features are input into a convolutional module to achieve feature map downsampling and output image features with target spatial resolution.

[0160] Of course, it should be noted that image processing in convolutional neural networks requires multiple stages, each of which uses the method of this invention. Alternatively, the above process can be repeated for each of the first few stages, with the last stage remaining the same as the original network.

[0161] It should also be understood that the method of this embodiment of the invention can be applied to training first, and then used after training. Therefore, optionally, this embodiment further includes:

[0162] The network is trained based on the cross-entropy loss of classification and pre-set region selection parameters; wherein, the region selection parameters represent the proportion of regions selected by the region selection module.

[0163] Optionally, the network training based on classification-based cross-entropy loss and pre-set region selection parameters includes:

[0164] Through the loss function L = L cls +λL sparse Conduct network training;

[0165] Where L represents network loss; L cls λ represents the cross-entropy loss for classification; λ represents the region selection parameter. N represents the number of region selection modules, B represents the number of training samples, n represents the nth region selection module, and b represents r b,n This represents the proportion of element 1 in the spatial mask obtained by the nth region selection module for the b-th training sample, and t represents the region selection parameter.

[0166] Here, λ = 0.1 to 0.5, t = 0.3 to 0.7.

[0167] Assume that the first three stages of the convolutional neural network employ the method of this embodiment, and these three stages have a total of N region selection modules. During training, based on B training samples, for the b-th training sample, the proportion of element 1 in the spatial mask generated by the n-th region selection module is r. b,n Then it represents the nth second convolutional module in the second convolutional processing in the image features r. b,n Convolution was performed on the region. During training, according to the loss function L = L cls +λL sparse Conduct network training.

[0168] The method of this invention involves a first convolutional process to extract features with low spatial resolution. Then, based on the outputs of each convolutional module in the first convolutional process, a spatial mask is generated using a region selection module. Further, the features are deformed and sampled based on the generated spatial mask, and a second convolutional process is used to perform convolution operations on the sampled feature regions. Finally, the outputs of the two branches are merged and fed into the next stage after passing through the last convolutional module.

[0169] In summary, after obtaining the first image feature, a first convolution process is performed based on the first image feature to obtain the second image feature. Then, based on each sub-image feature in the first convolution process, a spatial mask corresponding to each sub-image feature is determined. The first image feature and the spatial mask are then combined to perform a second convolution process to obtain the third image feature. Finally, the obtained second and third image features are fused to obtain the desired target image feature for subsequent processing. In this way, by using two branches, one performs efficient feature extraction and the other uses the spatial mask to perform only local convolution before recombination, the efficiency of image processing is greatly improved, and the waste of resources caused by redundant calculations is avoided.

[0170] like Figure 2 As shown, an image processing apparatus according to an embodiment of the present invention includes:

[0171] The acquisition module 210 is used to acquire the first image features;

[0172] The first processing module 220 is used to perform a first convolution process based on the first image features to obtain the second image features;

[0173] The second processing module 230 is used to determine the corresponding spatial mask based on each sub-image feature in the first convolutional processing;

[0174] The third processing module 240 is used to perform a second convolution process based on the first image features and the spatial mask to obtain the third image features;

[0175] The fourth processing module 250 is used to perform feature fusion based on the second image features and the third image features to obtain target image features.

[0176] Optionally, the first processing module includes:

[0177] The first processing submodule is used to downsample the first image features to obtain the fourth image features;

[0178] The second processing submodule is used to perform multiple convolution operations on the fourth image features to obtain the second image features.

[0179] Optionally, the sub-image features include the fourth image features and the result of each convolution operation in the multiple convolution operations.

[0180] Optionally, the second processing module includes:

[0181] The third processing submodule is used to perform adaptive average pooling and convolution operations on the sub-image features to obtain an initial mask;

[0182] The fourth processing submodule is used to binarize the initial mask to obtain the spatial mask of the sub-image features.

[0183] Optionally, the third processing module includes:

[0184] The fifth processing submodule is used to divide the channel dimension of the first image feature into G groups and the spatial dimension into K×K image blocks to obtain G×K×K image blocks; where G and K are integers greater than or equal to 1.

[0185] The sixth processing submodule is used to perform multiple convolution operations on each channel group of the G×K×K image blocks based on the corresponding spatial mask to obtain G calculation results;

[0186] The seventh processing submodule is used to concatenate the G calculation results along the channel dimension to obtain the third image feature.

[0187] Optionally, the fourth processing module includes:

[0188] The eighth processing submodule is used to determine channel attention based on the second image features;

[0189] The ninth processing submodule is used to obtain the target image features based on the second image features, the third image features, and the channel attention.

[0190] Optionally, the ninth processing submodule is further configured to:

[0191] Through formula Calculate the target image feature y; where y base Representing the second image feature, y refine The third image feature is represented by α, and the channel attention is represented by α.

[0192] Optionally, the device further includes:

[0193] The fifth processing module is used to downsample the target image features to obtain image features with the target spatial resolution.

[0194] Optionally, the image processing device includes a first part, a second part, and a third part;

[0195] The first part is used to perform a first convolution process on the first image features, and the first part includes L-1 concatenated first convolution modules; the second part is used to perform region selection on the sub-image features output by the L-1 first convolution modules, and the second part includes L-1 region selection modules, each of which is connected to a corresponding first convolution module; the third part is used to perform a second convolution process on the first image features, and the third part includes L-1 concatenated second convolution modules, each of which is connected to a corresponding region selection module.

[0196] L is an integer greater than or equal to 2.

[0197] Optionally, the device further includes:

[0198] The sixth processing module is used to train the network based on the cross-entropy loss of classification and the pre-set region selection parameters; wherein the region selection parameters represent the proportion of regions selected by the region selection module.

[0199] Optionally, the sixth processing module is further configured to:

[0200] Through the loss function L = L cls +λL sparse Conduct network training;

[0201] Where L represents network loss; L clsλ represents the cross-entropy loss for classification; λ represents the region selection parameter. N represents the number of region selection modules, B represents the number of training samples, n represents the nth region selection module, and r b,n represents the proportion of element 1 in the spatial mask obtained by the nth region selection module for the bth training sample, and t represents the region selection parameter.

[0202] After acquiring the first image feature, the device first performs a first convolution process based on the first image feature to obtain the second image feature. Then, based on each sub-image feature in the first convolution process, it determines the spatial mask corresponding to each sub-image feature. The first image feature and the spatial mask are then combined to perform a second convolution process to obtain the third image feature. Finally, the second and third image features are fused to obtain the desired target image feature for subsequent processing. In this way, by using two branches, one performs efficient feature extraction and the other uses a spatial mask to perform only local convolution before recombination, the efficiency of image processing is greatly improved, and resource waste caused by redundant calculations is avoided.

[0203] It should be noted that this device is an apparatus that applies the above-described method, and the implementation of the above-described method embodiments is applicable to this device and can achieve the same technical effect.

[0204] like Figure 3 As shown, an image processing device 300 according to an embodiment of the present invention includes a processor 310.

[0205] The processor 310 is used for:

[0206] Obtain the first image features;

[0207] The first image features are processed by a first convolution to obtain the second image features;

[0208] Based on the features of each sub-image in the first convolutional process, determine the corresponding spatial mask;

[0209] A second convolution process is performed based on the first image features and the spatial mask to obtain the third image features;

[0210] The target image features are obtained by fusing the second image features and the third image features.

[0211] The image processing device 300 also includes a transceiver 320 for receiving and sending data under the control of the processor 310.

[0212] Optionally, the processor is further configured to:

[0213] The first image feature is downsampled to obtain the fourth image feature;

[0214] The fourth image feature is subjected to multiple convolution operations to obtain the second image feature.

[0215] Optionally, the sub-image features include the fourth image features and the result of each convolution operation in the multiple convolution operations.

[0216] Optionally, the processor is further configured to:

[0217] Adaptive average pooling and convolution operations are performed on the sub-image features to obtain the initial mask;

[0218] The initial mask is binarized to obtain the spatial mask of the sub-image features.

[0219] Optionally, the processor is further configured to:

[0220] The channel dimension of the first image feature is divided into G groups, and the spatial dimension is divided into K×K image blocks, resulting in G×K×K image blocks; where G and K are integers greater than or equal to 1.

[0221] Based on the corresponding spatial mask, multiple convolution operations are performed on each channel group for the G×K×K image blocks to obtain G calculation results;

[0222] The G calculation results are concatenated along the channel dimension to obtain the third image feature.

[0223] Optionally, the processor is further configured to:

[0224] Based on the second image features, determine channel attention;

[0225] The target image features are obtained based on the second image features, the third image features, and the channel attention.

[0226] Optionally, the processor is further configured to:

[0227] Through formula Calculate the target image feature y; where y base Representing the second image feature, y refine The third image feature is represented by α, and the channel attention is represented by α.

[0228] Optionally, the processor is further configured to:

[0229] The target image features are downsampled to obtain image features with the target spatial resolution.

[0230] Optionally, the image processing device includes a first part, a second part, and a third part;

[0231] The first part is used to perform a first convolution process on the first image features, and the first part includes L-1 concatenated first convolution modules; the second part is used to perform region selection on the sub-image features output by the L-1 first convolution modules, and the second part includes L-1 region selection modules, each of which is connected to a corresponding first convolution module; the third part is used to perform a second convolution process on the first image features, and the third part includes L-1 concatenated second convolution modules, each of which is connected to a corresponding region selection module.

[0232] L is an integer greater than or equal to 2.

[0233] Optionally, the processor is further configured to:

[0234] The network is trained based on the cross-entropy loss of classification and pre-set region selection parameters; wherein, the region selection parameters represent the proportion of regions selected by the region selection module.

[0235] Optionally, the processor is further configured to:

[0236] Through the loss function L = L cls +λL sparse Conduct network training;

[0237] Where L represents network loss; L cls λ represents the cross-entropy loss for classification; λ represents the region selection parameter. N represents the number of region selection modules, B represents the number of training samples, n represents the nth region selection module, and r b,n represents the proportion of element 1 in the spatial mask obtained by the nth region selection module for the bth training sample, and t represents the region selection parameter.

[0238] After acquiring the first image feature, the image processing device first performs a first convolution process based on the first image feature to obtain the second image feature. Then, based on each sub-image feature in the first convolution process, it determines the spatial mask corresponding to each sub-image feature. The first image feature and the spatial mask are then combined to perform a second convolution process to obtain the third image feature. Finally, the second and third image features are fused to obtain the desired target image feature for subsequent processing. In this way, by using two branches, one performs efficient feature extraction and the other uses the spatial mask to perform only local convolution before recombination, the efficiency of image processing is greatly improved, and resource waste caused by redundant calculations is avoided.

[0239] Another embodiment of the image processing device of the present invention, such as Figure 4 As shown, it includes a transceiver 410, a processor 400, a memory 420, and a program or instructions stored in the memory 420 and executable on the processor 400; when the processor 400 executes the program or instructions, it implements the above-described image processing method.

[0240] The transceiver 410 is used to receive and send data under the control of the processor 400.

[0241] Among them, Figure 4 In this context, the bus architecture may include any number of interconnected buses and bridges, specifically linking various circuits together, represented by one or more processors (processor 400) and memory (memory 420). The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, and power management circuits, which are well known in the art and therefore will not be described further herein. The bus interface provides an interface. The transceiver 410 may be multiple elements, including transmitters and receivers, providing a unit for communicating with various other devices over a transmission medium. The processor 400 is responsible for managing the bus architecture and general processing, and the memory 420 may store data used by the processor 400 during operation.

[0242] An embodiment of the present invention provides a readable storage medium storing a program or instructions. When the program or instructions are executed by a processor, they implement the steps in the image processing method described above and achieve the same technical effect. To avoid repetition, further details are omitted here.

[0243] The processor is the processor in the image processing device described in the above embodiments. The readable storage medium includes computer-readable storage media, such as computer read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk.

[0244] It should be further noted that many of the functional components described in this specification are referred to as modules in order to emphasize the independence of their implementation.

[0245] In this embodiment of the invention, the module can be implemented in software so that it can be executed by various types of processors. For example, an identified executable code module may include one or more physical or logical blocks of computer instructions, which may be constructed as objects, procedures, or functions. Nevertheless, the executable code of the identified module does not need to be physically located together, but may include different instructions stored in different bits, which, when logically combined, constitute the module and achieve the module's intended purpose.

[0246] In practice, an executable code module can be a single instruction or many instructions, and can even be distributed across multiple different code segments, different programs, and across multiple memory devices. Similarly, operational data can be identified within the module and can be implemented in any suitable form and organized within any suitable type of data structure. This operational data can be collected as a single dataset or distributed across different locations (including different storage devices), and can exist, at least in part, solely as electronic signals within the system or network.

[0247] When a module can be implemented using software, considering the current level of hardware technology, modules that can be implemented in software can be implemented using hardware circuits by those skilled in the art to achieve the corresponding functions, without considering cost. These hardware circuits include conventional very-large-scale integrated circuits (VLSI) or gate arrays, as well as existing semiconductors such as logic chips and transistors, or other discrete components. Modules can also be implemented using programmable hardware devices, such as field-programmable gate arrays, programmable array logic, and programmable logic devices.

[0248] The exemplary embodiments described above are with reference to the accompanying drawings. Many different forms and embodiments are feasible without departing from the spirit and teachings of the invention. Therefore, the invention should not be construed as limiting the exemplary embodiments set forth herein. Rather, these exemplary embodiments are provided to make the invention complete and convey the scope of the invention to those skilled in the art. In these drawings, component dimensions and relative dimensions may be exaggerated for clarity. The terminology used herein is for the purpose of describing particular exemplary embodiments only and is not intended to be limiting. As used herein, unless clearly indicated otherwise, the singular forms “a,” “an,” and “the” are intended to include all such forms. It will be further understood that the terms “comprising” and / or “including”, when used in this specification, indicate the presence of the stated features, integers, steps, operations, components, and / or elements, but do not exclude the presence or addition of one or more other features, integers, steps, operations, components, and / or groups thereof. Unless otherwise indicated, when stated, a range of values ​​includes the upper and lower limits of the range and any subranges in between.

[0249] The above description represents the preferred embodiments of the present invention. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principles of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.

Claims

1. An image processing method, characterized in that, Applied to image processing devices, including: Obtain the first image features; The first image features are processed by a first convolution to obtain the second image features; Based on the features of each sub-image in the first convolutional process, determine the corresponding spatial mask; A second convolution process is performed based on the first image features and the spatial mask to obtain the third image features; The target image features are obtained by fusing the second image features and the third image features. The method further includes: Through the loss function Conduct network training; Where L represents network loss; L cls λ represents the cross-entropy loss for classification; λ represents the first region selection parameter. N represents the number of region selection modules, B represents the number of training samples, n represents the nth region selection module, and r b,n This represents the proportion of element 1 in the spatial mask obtained by the nth region selection module for the bth training sample, and t represents the second region selection parameter. The first region selection parameter and the second region selection parameter represent the proportion of the region selected by the region selection module; the region selection module is used to select regions of the sub-image features.

2. The method according to claim 1, characterized in that, The step of performing a first convolution process based on the first image features to obtain second image features includes: The first image feature is downsampled to obtain the fourth image feature; The fourth image feature is subjected to multiple convolution operations to obtain the second image feature.

3. The method according to claim 2, characterized in that, The sub-image features include the fourth image features and the result of each convolution operation in the multiple convolution operations.

4. The method according to claim 1 or 3, characterized in that, The step of determining the corresponding spatial mask based on each sub-image feature in the first convolutional processing includes: Adaptive average pooling and convolution operations are performed on the sub-image features to obtain the initial mask; The initial mask is binarized to obtain the spatial mask of the sub-image features.

5. The method according to claim 1, characterized in that, The step of performing a second convolution process based on the first image features and the spatial mask to obtain the third image features includes: The channel dimension of the first image feature is divided into G groups, and the spatial dimension is divided into K×K image blocks, resulting in G×K×K image blocks; where G and K are integers greater than or equal to 1. Based on the corresponding spatial mask, multiple convolution operations are performed on each channel group for the G×K×K image blocks to obtain G calculation results; The G calculation results are concatenated along the channel dimension to obtain the third image feature.

6. The method according to claim 1, characterized in that, The step of fusing features based on the second image features and the third image features to obtain target image features includes: Based on the second image features, determine channel attention; The target image features are obtained based on the second image features, the third image features, and the channel attention.

7. The method according to claim 6, characterized in that, Obtaining the target image features based on the second image features, the third image features, and the channel attention includes: Through formula Calculate the target image features ;in, This represents the second image feature. This represents the third image feature. This indicates the channel attention.

8. The method according to claim 1, characterized in that, After fusing features based on the second image features and the third image features to obtain the target image features, the method further includes: The target image features are downsampled to obtain image features with the target spatial resolution.

9. The method according to claim 1, characterized in that, The image processing device includes a first part, a second part, and a third part; The first part is used to perform a first convolution process on the first image features, and the first part includes L-1 concatenated first convolution modules; the second part is used to perform region selection on the sub-image features output by the L-1 first convolution modules, and the second part includes L-1 region selection modules, each of which is connected to a corresponding first convolution module; the third part is used to perform a second convolution process on the first image features, and the third part includes L-1 concatenated second convolution modules, each of which is connected to a corresponding region selection module. L is an integer greater than or equal to 2.

10. An image processing apparatus, characterized in that, include: The acquisition module is used to acquire the first image features; The first processing module is used to perform a first convolution process based on the first image features to obtain the second image features; The second processing module is used to determine the corresponding spatial mask based on each sub-image feature in the first convolutional processing; The third processing module is used to perform a second convolution process based on the first image features and the spatial mask to obtain the third image features. The fourth processing module is used to perform feature fusion based on the second image features and the third image features to obtain the target image features; The device further includes: The sixth processing module is used to process the loss function. Conduct network training; Where L represents network loss; L cls λ represents the cross-entropy loss for classification; λ represents the first region selection parameter. N represents the number of region selection modules, B represents the number of training samples, n represents the nth region selection module, and r b,n This represents the proportion of element 1 in the spatial mask obtained by the nth region selection module for the bth training sample, and t represents the second region selection parameter. The first region selection parameter and the second region selection parameter represent the proportion of the region selected by the region selection module; the region selection module is used to select regions of the sub-image features.

11. An image processing device, characterized in that, Includes a processor, the processor being used for: Obtain the first image features; The first image features are processed by a first convolution to obtain the second image features; Based on the features of each sub-image in the first convolutional process, determine the corresponding spatial mask; A second convolution process is performed based on the first image features and the spatial mask to obtain the third image features; The target image features are obtained by fusing the second image features and the third image features. The processor is also used to pass a loss function. Conduct network training; Where L represents network loss; L cls λ represents the cross-entropy loss for classification; λ represents the first region selection parameter. N represents the number of region selection modules, B represents the number of training samples, n represents the nth region selection module, and r b,n This represents the proportion of element 1 in the spatial mask obtained by the nth region selection module for the bth training sample, and t represents the second region selection parameter. The first region selection parameter and the second region selection parameter represent the proportion of the region selected by the region selection module; the region selection module is used to select regions of the sub-image features.

12. An image processing apparatus, comprising: A transceiver, a processor, a memory, and a program or instructions stored in the memory and executable on the processor; characterized in that, when the processor executes the program or instructions, it implements the image processing method as described in any one of claims 1-9.

13. A readable storage medium having a program or instructions stored thereon, characterized in that, When the program or instructions are executed by the processor, they implement the steps of the image processing method as described in any one of claims 1-9.