Significant target detection method based on channel attention module

By proposing a salient object detection method based on channel attention modules, the problem of low feature map resolution in deep learning is solved, and the accuracy and edge information of salient object detection are improved, thereby enhancing the detection performance in complex scenes.

CN118072043BActive Publication Date: 2026-06-26SHANGHAI ULUCU ELECTRON TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI ULUCU ELECTRON TECH CO LTD
Filing Date
2024-03-04
Publication Date
2026-06-26

Smart Images

  • Figure CN118072043B_ABST
    Figure CN118072043B_ABST
Patent Text Reader

Abstract

The application discloses a significant target detection method based on a channel attention module, relates to the technical field of computer vision, and can solve the problem that in the prior art, due to the low resolution of a deep feature map, the significant target is short of detailed information, and the sensitivity to the boundary of the significant target is not high in a complex scene. The specific technical scheme is as follows: first, a backbone network is used to extract a to-be-detected image into at least five feature maps, and the feature maps are divided into at least two levels according to the level order, and are respectively transmitted into at least two feature enhancement modules according to the levels for feature enhancement to obtain at least two feature level maps; then, the attention weights of each feature channel are obtained, and the at least two feature level maps are weighted according to the attention weights of each feature channel to obtain at least two weighted feature level maps; finally, the at least two weighted feature level maps are input into a feature fusion module for splicing and fusion to generate a predicted saliency map. The application is used for significant target detection.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and image processing technology, and more specifically to a salient target detection method based on a channel attention module. Background Technology

[0002] Salient object detection is essentially a region of interest (ROI) extraction algorithm. Its purpose is to simulate the human eye's visual attention mechanism, locating the most attention-grabbing regions in an image and discarding redundant and useless information. Then, it extracts the object contour pixel-by-pixel from these located regions. In recent years, with the development of computer vision technology, salient object detection has been widely applied to computer vision tasks such as visual tracking, image retrieval, and semantic segmentation.

[0003] Salient object detection is generally divided into traditional salient object detection and deep learning-based salient object detection. Traditional salient object detection typically relies on manually crafted salient prior information, such as low-level features like color, texture, and image gradients, to calculate object saliency. The drawbacks of this method are the time-consuming nature of manually crafting features and its low detection efficiency. Furthermore, due to the lack of semantic information in low-level features, its generalization performance is poor, and its detection performance is limited by the application scenario, performing poorly in complex scenes with multiple targets, noise interference, and unclear background structures. In contrast, deep learning-based salient object detection methods effectively overcome the problems of time consumption and poor generalization of traditional methods. They eliminate the need for manually crafted features, directly generating deep feature maps containing semantic information through deep neural networks. Therefore, deep learning-based salient object detection methods can quickly locate salient objects in relatively complex scenes without using any prior information. However, deep learning-based salient object detection methods also have shortcomings. Firstly, the resolution of deep feature maps is relatively low, resulting in a lack of detailed information about salient objects and low sensitivity to salient object boundaries in complex scenes. Summary of the Invention

[0004] This invention provides a salient object detection method based on a channel attention module, which solves the problem in existing deep learning-based salient object detection methods that suffer from low resolution of deep feature maps, resulting in a lack of detailed information about salient objects and low sensitivity to salient object boundaries in complex scenes. The technical solution is as follows:

[0005] The present invention provides a salient target detection method based on a channel attention module, the method comprising:

[0006] The image to be detected is extracted into at least five feature maps using a backbone network, and the at least five feature maps are divided into at least two levels according to the hierarchical order.

[0007] The at least five feature maps are passed to at least two corresponding feature enhancement modules according to their levels to perform feature enhancement and obtain at least two feature level maps.

[0008] The attention weight for each feature channel is obtained based on the deepest feature map among the at least five feature maps.

[0009] The at least two feature hierarchy maps are weighted according to the attention weight of each feature channel to obtain at least two weighted feature hierarchy maps;

[0010] The at least two weighted feature hierarchy maps are input into the feature fusion module for splicing and fusion to generate a predicted saliency map.

[0011] The salient object detection method based on channel attention modules provided by this invention first extracts at least five layers of feature maps from the image to be detected through a backbone network, and divides these at least five layers of feature maps into at least two levels according to their hierarchical order. Then, the at least five layers of feature maps are fed into at least two corresponding feature enhancement modules for feature enhancement to obtain at least two feature level maps. Attention weights for each feature channel are obtained based on the deepest feature map among the at least five layers of feature maps. The at least two feature level maps are weighted according to the attention weights of each feature channel to obtain at least two weighted feature level maps. Finally, the at least two weighted feature level maps are input into a feature fusion module for concatenation and fusion to generate a predicted salient map. This invention employs a multi-level feature map fusion method, which can retain local spatial information to a large extent while acquiring semantic information. Furthermore, the feature enhancement module performs multiple interactions on feature maps of different levels, making the fused features semantically richer and spatially more accurate. Furthermore, the channel attention module is used to obtain the sensitivity of each feature channel in the feature map to salient targets. Based on the sensitivity, an attention weight of appropriate size is assigned to the feature map, which effectively filters out redundant channels with low sensitivity to salient targets, further improving the performance of the entire detection network in predicting salient maps. Finally, the feature fusion module can better represent the edge information of salient targets in the salient map by increasing the fusion method of high-level weighted feature map resolution.

[0012] As a further aspect of the present invention: before extracting at least five feature maps from the image to be detected through the backbone network, the method further includes:

[0013] The image to be detected is preprocessed, specifically including adjusting the resolution and pixel values ​​of the image to be detected.

[0014] As a further embodiment of the present invention: the backbone network is one of the following: Darknet network, VGG network, ResNet network, Inception network, and DenseNet network.

[0015] As a further aspect of the present invention: the at least five feature maps are respectively fed into at least two corresponding feature enhancement modules for feature enhancement to obtain at least two feature level maps, including:

[0016] The at least five layers of feature maps are fed into at least two corresponding feature enhancement modules according to their levels to obtain at least five layers of enhanced feature maps;

[0017] The at least five enhanced feature maps are spliced ​​together according to their respective levels to obtain at least two feature level maps.

[0018] As a further aspect of the present invention: obtaining the attention weight of each feature channel based on the deepest feature map of the at least five feature maps includes:

[0019] The deepest feature map in the at least five feature maps is used to generate a global information feature map of at least one scale through spatial pyramid pooling or fast spatial pyramid pooling.

[0020] The global information feature maps at at least one scale are concatenated by channel to obtain a fused feature map;

[0021] The attention weights for each feature channel are obtained based on the fused feature map.

[0022] As a further aspect of the present invention: obtaining the attention weights for each feature channel based on the fused feature map includes:

[0023] Calculate the feature mean of each channel based on the fused feature map;

[0024] Calculate the feature mean vector of each channel based on the feature mean of each channel;

[0025] The attention weight for each feature channel is calculated based on the feature mean vector of the channel.

[0026] As a further aspect of the present invention: the calculation of the feature mean of each channel based on the fused feature map is performed according to a first formula, the first formula including:

[0027]

[0028] Among them, sq c The value represents the feature mean of the c-th channel in the fused feature map, where c is 0, 1, 2, ..., n, and n represents the number of channels in the feature map; H represents the height of the fused feature map; W represents the width of the fused feature map; u c This represents the feature of the c-th channel in the fused feature map; u c (i,j)u c(represents u) c The feature value at coordinate (i,j) (i,j);

[0029] The step of calculating the feature mean vector of each channel based on the feature mean of each channel is performed according to a second formula, which includes:

[0030]

[0031] Among them, V sq sq represents the feature mean vector of the channel; sq1 represents the feature mean of the first channel in the fused feature map; sq2 represents the feature mean of the second channel in the fused feature map; sq n This represents the feature mean of the nth channel in the fused feature map.

[0032] As a further aspect of the present invention: the step of calculating the attention weight of each feature channel based on the feature mean vector of the channel specifically includes:

[0033] V aw =Softmax(W2δ(W1V) sq ));

[0034] in,

[0035] δ(x) = max(x, 0);

[0036] Among them, v aw W1 and W2 are the attention weights of the feature channels after recalibration and normalization to the [0, 1] interval. W1 and W2 are the parameter matrices of the two fully connected layers, and δ represents the activation function RelU.

[0037] As a further aspect of the present invention: the step of weighting the at least two feature hierarchy maps according to the attention weight of each feature channel to obtain at least two weighted feature hierarchy maps includes:

[0038] Multiply the at least two feature hierarchy maps by the attention weight of each feature channel to obtain at least two weighted feature hierarchy maps.

[0039] As a further aspect of the present invention: the step of inputting the at least two weighted feature hierarchy maps into the feature fusion module to generate a predicted saliency map includes:

[0040] After upsampling the higher-level maps in the at least two weighted feature hierarchy maps, the predicted saliency map is generated by splicing and fusing the at least two weighted feature hierarchy maps.

[0041] According to a second aspect of the present invention, a salient target detection device based on a channel attention module is provided. The salient target detection device based on a channel attention module includes a processor and a memory. The memory stores at least one computer instruction, which is loaded and executed by the processor to implement the steps performed in the salient target detection method based on a channel attention module as described above.

[0042] According to a third aspect of the present invention, a computer-readable storage medium is provided, the storage medium storing at least one computer instruction, the instruction being loaded and executed by a processor to perform the steps performed in the salient target detection method based on the channel attention module described in any of the preceding claims.

[0043] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit the invention. Attached Figure Description

[0044] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with the invention and, together with the description, serve to explain the principles of the invention.

[0045] Figure 1 This is a flowchart of a salient target detection method based on a channel attention module provided in an embodiment of the present invention;

[0046] Figure 2 This is a framework diagram of the salient target detection method based on the channel attention module provided in the embodiments of the present invention.

[0047] Figure 3 This is a structural diagram of the backbone network in the salient target detection method based on the channel attention module provided in this embodiment of the invention.

[0048] Figure 4 This is a structural diagram of the feature enhancement module in the salient target detection method based on the channel attention module provided in the embodiments of the present invention.

[0049] Figure 5 This is a structural diagram of the channel attention module in the salient target detection method based on the channel attention module provided in the embodiments of the present invention.

[0050] Figure 6 This is a structural diagram of the feature fusion module in the salient target detection method based on the channel attention module provided in the embodiments of the present invention. Detailed Implementation

[0051] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of systems and methods consistent with some aspects of the invention as detailed in the appended claims.

[0052] The salient target detection method based on the channel attention module provided in this embodiment of the invention, such as... Figure 1 and Figure 2 As shown, the salient object detection method based on the channel attention module includes the following steps:

[0053] Step 101: Extract the image to be detected into at least five feature maps using a backbone network, and divide the at least five feature maps into at least two levels according to the hierarchical order.

[0054] In practical use, for an input image, the multi-layer feature map output by the backbone network can be divided into two levels, low and high, or into three levels, low, medium and high.

[0055] In one embodiment, before extracting at least five feature maps from the image to be detected via a backbone network, the method further includes:

[0056] Preprocessing the image to be detected includes adjusting its resolution and pixel values.

[0057] In practical use, the image to be detected needs to be preprocessed. The resolution is scaled to 640*640 by linear interpolation, and the pixel value is divided by 255 to shrink to between 0 and 1.

[0058] In one embodiment, the backbone network is one of the following: Darknet network, VGG network, ResNet network, Inception network, and DenseNet network.

[0059] In this embodiment, the backbone network is illustrated using the Darknet network as an example. Feature maps output from the five residual modules of the Darknet network are extracted from the preprocessed image, such as... Figure 3 As shown, feature maps F1 to F5 are extracted from the five residual modules of the backbone network. F1, F2, and F3 are used as low-level feature maps, and F4 and F5 are used as high-level feature maps.

[0060] Step 102: Input at least five feature maps into at least two corresponding feature enhancement modules according to their levels to perform feature enhancement and obtain at least two feature level maps.

[0061] Continuing with the example above, in this embodiment, the low-level feature map and the high-level feature map are input into two feature enhancement modules, FEM_1 and FEM_2, respectively, to enhance the saliency of the feature maps and obtain low and high feature level maps. Correspondingly, if the multi-layer feature map output by the backbone network is divided into three levels—low, medium, and high—the number of feature enhancement modules increases to three, resulting in low, medium, and high feature level maps.

[0062] In one embodiment, passing at least five feature maps hierarchically into at least two corresponding feature enhancement modules for feature enhancement to obtain at least two feature hierarchy maps includes:

[0063] At least five layers of feature maps are passed to at least two corresponding feature enhancement modules according to their levels to obtain at least five layers of enhanced feature maps;

[0064] At least five layers of enhanced feature maps are stitched together according to their respective levels to obtain at least two feature level maps.

[0065] In this embodiment, the feature enhancement module first uses convolution operations to ensure a consistent number of channels in the input multi-layer feature maps. Then, it connects the transport subnets of each feature map layer in parallel and repeatedly adjusts the resolution of each layer through upsampling and downsampling operations. This repeated interaction and fusion between the multi-layer feature maps further enriches the semantic and detail information of salient targets in the feature maps. Finally, after adjusting the feature maps generated through multiple interactions to a uniform resolution, they are concatenated by channel and connected with a 1*1 convolution for dimensionality reduction, resulting in the enhanced feature map. Furthermore, the upsampling operation in the feature enhancement module can use linear interpolation or nearest-neighbor interpolation; the concatenation and dimensionality reduction of the multi-layer feature maps can also be performed by stacking the multi-layer feature maps bit-by-bit.

[0066] like Figure 4 As shown, the input feature maps undergo two interactions in the feature enhancement module. The interaction process for low-level feature maps is as follows: the upper-level feature size remains unchanged, the middle-level feature is upsampled by a factor of 2, and the lower-level feature is upsampled by a factor of 4. The three are then added together to obtain a new upper-level feature. Similarly, the middle-level feature size remains unchanged, the upper-level feature undergoes a 3x3 convolution with a stride of 2, the lower-level feature is upsampled by a factor of 2, and the three are then added together to obtain a new middle-level feature. Finally, the lower-level feature size remains unchanged, the middle-level feature undergoes a 3x3 convolution with a stride of 2, and the upper-level feature undergoes two 3x3 convolutions with a stride of 2. The three are then added together to obtain a new lower-level feature. The interaction process for high-level feature maps is similar to that for low-level feature maps.

[0067] For the low-level feature maps f1, f2, and f3 generated after two interactions, the resolution of f1 and f3 is adjusted to the same size as f2. Then, the three are concatenated by channel, and a 1*1 convolution is used to reduce the number of channels in the concatenated feature map, resulting in an enhanced low-level feature map F with a size of 160*160*128.L For generating high-level feature maps f4 and f5, a similar operation is used to obtain an enhanced high-level feature map F with a size of 40*40*128. H .

[0068] Step 103: Obtain the attention weight for each feature channel based on the deepest feature map in at least five feature maps.

[0069] In one embodiment, obtaining the attention weight for each feature channel based on the deepest feature map among at least five feature maps includes:

[0070] Generate a global information feature map of at least one scale from the deepest feature map of at least five feature maps using spatial pyramid pooling or fast spatial pyramid pooling.

[0071] By concatenating global information feature maps of at least one scale along channels, a fused feature map is obtained.

[0072] The attention weights for each feature channel are obtained from the fused feature map.

[0073] In this embodiment, since not all channels in the enhanced feature map are highly correlated with the salient target, a channel attention module is used to process the global information of the image to further determine the importance of each channel in the enhanced feature map to the salient target. The channel attention module uses a spatial pyramid attention segmentation structure to establish an efficient channel attention mechanism. Its working principle is to further obtain multi-scale global information features from the deepest feature map output by the backbone network through spatial pyramid pooling (SPP). Then, based on the channel attention module, channel attention is extracted from feature maps at different scales, resulting in channel attention vectors at each scale. Finally, a normalized exponential function (Softmax) is used to recalibrate all channel attention vectors, generating attention weights after multi-scale channel interaction. These weights are then applied to the enhanced feature map to reduce the negative impact of feature channels with low correlation to the salient target on subsequent salient map prediction. Alternatively, fast spatial pyramid pooling can be used to further improve the efficiency of the channel attention module in generating channel attention weights.

[0074] In actual use, such as Figure 5 As shown, for the input highest-level feature map F5, multi-scale global information feature maps X1 to X4 are first generated through max pooling layers with kernel sizes of 3*3, 5*5, and 7*7 in spatial pyramid pooling, where each feature map has 64 channels. Then, these feature maps are concatenated by channel, and a 1*1 convolution is used to reduce the number of channels, resulting in the multi-scale fused feature map F. X Then, based on the feature map F XGenerate the corresponding attention weight for each channel.

[0075] In one embodiment, obtaining the attention weight for each feature channel based on the fused feature map includes:

[0076] Calculate the feature mean of each channel based on the fused feature map;

[0077] Calculate the feature mean vector of each channel based on the feature mean of each channel;

[0078] The attention weight for each feature channel is calculated based on the feature mean vector of the channel.

[0079] Specifically, F is first calculated using global average pooling. X The feature mean of each channel is calculated, and then the feature mean vector of n channels of the feature map is calculated. After global average pooling, two fully connected layers are connected to perform two mappings on the feature mean vector of n channels of the feature map. The normalized exponential function Softmax is used to recalibrate the weights of the channel attention information after mapping.

[0080] In one embodiment, the feature mean of each channel is calculated based on the fused feature map according to a first formula, which includes:

[0081]

[0082] Among them, sq c Let represent the feature mean of the c-th channel in the fused feature map, where c is 0, 1, 2, ..., n, and n represents the number of channels in the feature map; H represents the height of the fused feature map; W represents the width of the fused feature map; u c This represents the feature of the c-th channel in the fused feature map; u c (i,j)u c (represents u) c The feature value at coordinate (i,j) (i,j);

[0083] The feature mean vector of each channel is calculated based on its feature mean, according to the second formula, which includes:

[0084]

[0085] Among them, V sq sq represents the feature mean vector of the channels; sq1 represents the feature mean of the first channel in the fused feature map; sq2 represents the feature mean of the second channel in the fused feature map; sq n This represents the feature mean of the nth channel in the fused feature map.

[0086] In one embodiment, the attention weight for each feature channel is calculated based on the feature mean vector of the channel, specifically including:

[0087] V aw =Softmax(W2δ(W1V) sq ));

[0088] in,

[0089] δ(x) = max(x, 0);

[0090] Among them, v aw W1 and W2 are the attention weights of the feature channels after recalibration and normalization to the [0, 1] interval. W1 and W2 are the parameter matrices of the two fully connected layers, and δ represents the activation function RelU.

[0091] In practical applications, the sigmoid function can be used to replace the softmax function to normalize the channel attention vector.

[0092] Step 104: Weight at least two feature level maps by weighting them according to the attention weight of each feature channel to obtain at least two weighted feature level maps.

[0093] In one embodiment, obtaining at least two weighted feature hierarchy maps by weighting at least two feature hierarchy maps according to the attention weight of each feature channel includes:

[0094] Multiply at least two feature hierarchy maps by the attention weights of each feature channel to obtain at least two weighted feature hierarchy maps.

[0095] Specifically, let F L and F H With v aw Multiplying them together yields the weighted low-level feature map F. RL and the weighted high-level feature map F RH The weighted feature hierarchy map contains feature channels that contain significant target information.

[0096] Step 105: Input at least two weighted feature level maps into the feature fusion module for splicing and fusion to generate a predicted saliency map.

[0097] In one embodiment, inputting at least two weighted feature hierarchy maps into a feature fusion module to generate a predicted saliency map includes:

[0098] After upsampling the higher-level maps in at least two weighted feature hierarchy maps, a predicted saliency map is generated by splicing and fusing the at least two weighted feature hierarchy maps.

[0099] In this embodiment, as Figure 6As shown, the weighted feature map is input into the feature fusion module to generate the final predicted saliency map. The feature fusion module upsamples the input high-level feature map to the same size as the low-level feature map, then concatenates and reduces the dimensionality of the two feature maps by channel to generate a low-level feature map that incorporates global information. This process is then repeated to obtain the final predicted saliency map. Specifically, F... RH After upsampling by 4 times and F RL The data is concatenated by channel and then optimized for dimensionality reduction to generate F'RL. Then, F'RL and F... RH Upsampled by 4x and 16x respectively, and then spliced ​​together by channel, the two were optimized by dimensionality reduction to generate a prediction saliency map with a final size of 640*640*1.

[0100] This invention enhances and fuses multi-level feature maps output by a backbone network, utilizing features that simultaneously contain high-level semantic information and low-level detail information to detect salient targets. A channel attention module is embedded on top of this, processing the highest-level feature map output by the backbone network to mine the correlation between each feature channel in the fused feature map and the salient target, and assigning a weight to each feature channel based on the degree of correlation. Feature channels with higher weights are then used in subsequent detection tasks, while feature channels with lower weights are filtered out, allowing the network to achieve better detection performance for salient targets.

[0101] In summary, the salient object detection method based on channel attention modules provided by this invention first extracts at least five layers of feature maps from the image to be detected using a backbone network, and then divides these five layers into at least two levels according to their hierarchical order. Next, the at least five layers of feature maps are fed into at least two corresponding feature enhancement modules for feature enhancement, resulting in at least two feature level maps. Then, the attention weight of each feature channel is obtained based on the deepest feature map among the at least five layers. The at least two feature level maps are weighted according to the attention weights of each feature channel to obtain at least two weighted feature level maps. Finally, the at least two weighted feature level maps are input into a feature fusion module for concatenation and fusion to generate a predicted salient map. This invention employs a multi-level feature map fusion method, which can retain local spatial information to a large extent while acquiring semantic information. Furthermore, the feature enhancement module performs multiple interactions on feature maps of different levels, making the fused features semantically richer and spatially more accurate. Furthermore, the channel attention module is used to obtain the sensitivity of each feature channel in the feature map to salient targets. Based on the sensitivity, an attention weight of appropriate size is assigned to the feature map, which effectively filters out redundant channels with low sensitivity to salient targets, further improving the performance of the entire detection network in predicting salient maps. Finally, the feature fusion module can better represent the edge information of salient targets in the salient map by increasing the fusion method of high-level weighted feature map resolution.

[0102] Based on the above Figure 1 In addition to the salient target detection method based on channel attention modules described in the corresponding embodiments, another embodiment of the present invention provides a salient target detection device based on channel attention modules. This device includes a processor and a memory, wherein the memory stores at least one computer instruction, which is loaded and executed by the processor to implement the above-described method. Figure 1 The corresponding embodiment describes a salient target detection method based on a channel attention module.

[0103] Based on the above Figure 1 The salient target detection method based on the channel attention module described in the corresponding embodiments of the present invention also provides a computer-readable storage medium. For example, a non-transitory computer-readable storage medium may be a read-only memory (ROM), a random access memory (RAM), a CD-ROM, magnetic tape, a floppy disk, or an optical data storage device, etc. This storage medium stores at least one computer instruction for executing the above-described method. Figure 1 The salient target detection method based on the channel attention module described in the corresponding embodiments will not be repeated here.

[0104] Other embodiments of the invention will readily occur to those skilled in the art upon consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the invention that follow the general principles of the invention and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of the invention are indicated by the following claims.

[0105] It should be understood that the present invention is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of the invention is limited only by the appended claims.

Claims

1. A salient target detection method based on a channel attention module, characterized in that, The method includes: The image to be detected is extracted into at least five feature maps using a backbone network, and the at least five feature maps are divided into at least two levels according to the hierarchical order. The at least five feature maps are passed to at least two corresponding feature enhancement modules according to their levels to perform feature enhancement and obtain at least two feature level maps. The attention weight for each feature channel is obtained based on the deepest feature map among the at least five feature maps. The at least two feature hierarchy maps are weighted according to the attention weight of each feature channel to obtain at least two weighted feature hierarchy maps; The at least two weighted feature hierarchy maps are input into the feature fusion module for splicing and fusion to generate a prediction saliency map; The step of obtaining the attention weight for each feature channel based on the deepest feature map among the at least five feature maps includes: The deepest feature map in the at least five feature maps is used to generate a global information feature map of at least one scale through spatial pyramid pooling or fast spatial pyramid pooling; the global information feature maps of at least one scale are concatenated by channel to obtain a fused feature map; the attention weight of each feature channel is obtained based on the fused feature map. The step of obtaining the attention weight for each feature channel based on the fused feature map includes: The feature mean of each channel is calculated based on the fused feature map; the feature mean vector of each channel is calculated based on the feature mean of each channel; and the attention weight of each feature channel is calculated based on the feature mean vector of the channel.

2. The salient target detection method based on a channel attention module according to claim 1, characterized in that, Before extracting at least five feature maps from the image to be detected via the backbone network, the method further includes: The image to be detected is preprocessed, specifically including adjusting the resolution and pixel values ​​of the image to be detected.

3. The salient target detection method based on a channel attention module according to claim 1, characterized in that, The backbone network is one of the following: Darknet, VGG, ResNet, Inception, and DenseNet.

4. The salient target detection method based on a channel attention module according to claim 1, characterized in that, The step of passing the at least five feature maps into at least two corresponding feature enhancement modules according to their levels to obtain at least two feature level maps includes: The at least five layers of feature maps are fed into at least two corresponding feature enhancement modules according to their levels to obtain at least five layers of enhanced feature maps; The at least five enhanced feature maps are spliced ​​together according to their respective levels to obtain at least two feature level maps.

5. The salient target detection method based on a channel attention module according to claim 1, characterized in that, The calculation of the feature mean of each channel based on the fused feature map is performed according to a first formula, which includes: ; in, sq c The fusion feature map represents the first... c The characteristic mean of each channel, c For 0, 1, 2, ..., n, n Indicates the number of channels in the feature map; H Indicates the height of the fused feature map; W This represents the width of the fused feature map; u c The fusion feature map represents the first... c Characteristics of each channel; u c (i,j) express u c In coordinates (i,j) Location feature values; The step of calculating the feature mean vector of each channel based on the feature mean of each channel is performed according to a second formula, which includes: in, V sq Represents the characteristic mean vector of the channel; sq 1 This represents the feature mean of the first channel in the fused feature map; sq 2 This represents the feature mean of the second channel in the fused feature map; sq n The fusion feature map represents the first... n The characteristic mean of each channel.

6. The salient target detection method based on a channel attention module according to claim 1, characterized in that, The step of calculating the attention weight for each feature channel based on the feature mean vector of the channel specifically includes: ; in, ; ; Among them, v aw These are the attention weights of the feature channels after recalibration and normalization to the [0, 1] interval. W 1 and W 2 These are the parameter matrices of two fully connected layers. Represents the activation function RelU.

7. The salient target detection method based on a channel attention module according to claim 1, characterized in that, The step of weighting the at least two feature hierarchy maps according to the attention weight of each feature channel to obtain at least two weighted feature hierarchy maps includes: Multiply the at least two feature hierarchy maps by the attention weight of each feature channel to obtain at least two weighted feature hierarchy maps.

8. The salient target detection method based on a channel attention module according to claim 1, characterized in that, The step of inputting the at least two weighted feature hierarchy maps into the feature fusion module to generate a predicted saliency map includes: After upsampling the higher-level maps in the at least two weighted feature hierarchy maps, the predicted saliency map is generated by splicing and fusing the at least two weighted feature hierarchy maps.