A feature extraction method and apparatus

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing multiple convolutional layers and dilated convolutional layers in the deep learning object detection algorithm to generate feature maps of different scales, and utilizing an attention mechanism to adjust the focus, the problem of low detection accuracy for small objects in multi-scale object detection is solved, achieving accurate detection of both large and small objects.

CN115294361BActive Publication Date: 2026-06-16CHINA TELECOM CLOUD TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: CHINA TELECOM CLOUD TECH CO LTD
Filing Date: 2022-07-15
Publication Date: 2026-06-16

Application Information

Patent Timeline

15 Jul 2022

Application

16 Jun 2026

Publication

CN115294361B

IPC: G06V10/52; G06V10/82; G06N3/045; G06N3/0464; G06N3/0499; G06N3/08

CPC: G06V10/52; G06V10/82; G06N3/08

AI Tagging

Application Domain

Character and pattern recognition Neural learning methods

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN115294361B_ABST

Patent Text Reader

Abstract

Embodiments of the present application relate to a feature extraction method and device. The method comprises: performing convolution processing on a target image using a plurality of convolution layers to obtain a first feature map; inputting the first feature map into a plurality of first atrous convolution layers with different dilation coefficients respectively to perform convolution processing, to obtain N second feature maps representing different scales; determining N scale proportion coefficients corresponding to the N second feature maps respectively based on an attention mechanism; and performing feature extraction based on the N second feature maps and the scale proportion coefficients corresponding to the N second feature maps respectively, to obtain a final feature map of the target image. The semantic information of the target image is deeply extracted, and the spatial information of the target of different scales is focused on in the feature extraction process, so that the detection accuracy of large targets is met, the detection accuracy of small targets is improved, and the detection requirements of multi-scale targets are met.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to the field of computer vision technology, and in particular to a method, apparatus, computing device and computer-readable storage medium for feature extraction. Background Technology

[0002] Feature extraction plays a crucial role in image processing, such as object detection and image classification. The following section uses object detection as an example to detail the application of feature extraction in object detection.

[0003] Object detection refers to the process by which computers automatically identify objects in an image and locate them within the image. With the continuous development of artificial intelligence, deep learning object detection algorithms are gradually replacing traditional object detection methods. Deep learning object detection algorithms train an object detection network using a large number of sample images. Then, when an image is input into the trained network, it outputs the object detection result for that image.

[0004] Multi-scale object detection has always been a challenge, as a single object detection network cannot simultaneously achieve high accuracy in detecting both large and small targets. For example, objects in an image may have different scales (objects farther from the camera appear at different scales than those closer); the same object may also appear at different scales in different images due to relative motion with the camera. Deep learning in object detection primarily extracts richer image features by repeatedly stacking convolutional layers. As the number of convolutional layers increases, the semantic information in the obtained image features becomes richer, leading to more accurate detection of large targets. However, spatial information is continuously weakened with increasing convolutional layers, resulting in increasingly inaccurate detection of small targets in the image.

[0005] Researchers have proposed a method for multi-scale object detection: utilizing features of different resolutions generated during convolution to detect objects at different scales. For example, high-dimensional features with lower resolution but richer semantic information obtained from a high-convolutional-layer approach and low-resolution features obtained from a low-convolutional-layer approach are scaled to the same size and then superimposed. The superimposed features are then used in subsequent object detection steps. In this way, high-dimensional features can be used to detect large-scale objects, and low-dimensional features can be used to detect small-scale objects. However, due to the low number of convolutional layers, the high-resolution low-level features lack sufficient semantic information. Therefore, object detection accuracy for small objects based on the superimposed features obtained in the above method remains low. Summary of the Invention

[0006] This invention provides a feature extraction method to improve the detection accuracy of small targets in images.

[0007] In a first aspect, embodiments of the present invention provide a feature extraction method, comprising:

[0008] The target image is processed by convolutional layers to obtain a first feature map; the input feature map and the output feature map of at least one of the multiple convolutional layers have the same size.

[0009] The first feature map is input into multiple first dilated convolutional layers with different dilation coefficients for convolution processing to obtain N second feature maps representing different scales;

[0010] Based on the attention mechanism, N scale ratio coefficients are determined for each of the N second feature maps; wherein, the scale ratio coefficients are used to characterize the degree of influence of the corresponding second feature map on feature extraction;

[0011] Feature extraction is performed based on the N second feature maps and the scale ratio coefficients corresponding to each of the N second feature maps to obtain the final feature map of the target image.

[0012] Multiple convolutional layers are used to process the target image. Since the input and output feature maps of at least one of the convolutional layers are of the same size, the resulting first feature map retains more spatial information. Continuing feature extraction based on this first feature map, which retains more spatial information, improves the detection accuracy of small targets. During further feature extraction, the first feature map is fed into first dilated convolutional layers with different dilation coefficients. These layers generate multiple second feature maps with different receptive fields, meaning they can represent features of targets at different scales. An attention mechanism is used to determine N scale ratio coefficients corresponding to each of the N second feature maps. These coefficients, obtained through the attention mechanism, reflect the degree to which the corresponding second feature map contributes to feature extraction. By extracting these scale ratio coefficients, the focus on different target images can be adaptively adjusted. Finally, feature extraction is performed based on the N second feature maps and their respective scale ratio coefficients to obtain the final feature map. The above method for obtaining the final feature map not only deeply extracts the semantic information of the target image, but also pays attention to the spatial information of targets at different scales during the feature extraction process. It satisfies the detection accuracy of large targets while improving the detection accuracy of small targets, thus taking into account the detection needs of multi-scale targets.

[0013] In some embodiments, the plurality of convolutional layers includes at least one second dilated convolutional layer; the input feature map and the output feature map of the second dilated convolutional layer are of the same size.

[0014] Using a second dilated convolutional layer increases the receptive field, further ensuring that the obtained first feature map retains more spatial information.

[0015] In some embodiments, determining the N scale coefficients corresponding to the N second feature maps based on an attention mechanism includes:

[0016] For any second feature map, the feature vector corresponding to the second feature map is obtained by performing channel-wise DW convolution and point-wise PW convolution on the second feature map.

[0017] By using a fully connected layer to extract spatial information weights from the N feature vectors corresponding to the N second feature maps, N scale coefficients corresponding to the N second feature maps are obtained.

[0018] DW convolution can integrate and extract information within each channel, while PW convolution can extract and integrate information between channels. By using fully connected layers to extract spatial information weights from the N feature vectors corresponding to the N second feature maps, N scale coefficients are obtained. In this way, the extraction of scale information is completed with the fewest parameters.

[0019] In some embodiments, feature extraction is performed based on the N second feature maps and the scale coefficients corresponding to each of the N second feature maps to obtain the final feature map of the target image, including:

[0020] For any second feature map, the scale ratio coefficient corresponding to the second feature map is used to weight the second feature map to obtain the third feature map corresponding to the second feature map;

[0021] The N third feature maps corresponding to the N second feature maps are concatenated, and the concatenated feature maps are convolved to obtain the final feature map of the target image.

[0022] The second feature map is obtained based on the first feature map. Therefore, the second feature map represents the features of targets at different scales while retaining more spatial information. Processing the second feature map in the above manner not only extracts semantic information in depth but also focuses on the spatial information of targets at different scales during feature extraction, thus improving the detection accuracy of both large and small targets.

[0023] In some embodiments, the first feature map is input into multiple first dilated convolutional layers with different dilation coefficients for convolution processing to obtain N second feature maps, including:

[0024] The first feature map is input into N-1 first dilated convolutional layers with different dilation coefficients for convolution processing to obtain N-1 second feature maps;

[0025] A second feature map is obtained by performing average pooling, convolution, and interpolation on the first feature map.

[0026] While dilated convolution can increase the receptive field, the presence of holes may lead to the loss of crucial information. Therefore, this approach not only uses a first dilated convolutional layer to convolve the first feature map but also applies average pooling to it. This compresses the feature map, simplifying the network's computational complexity while preserving key information. Further convolution is then performed to extract more features, and interpolation ensures that the resulting second feature map has the same size as the one obtained through dilated convolution.

[0027] In some embodiments, multiple convolutional layers are used to convolve the target image to obtain a first feature map, including:

[0028] The target image is input into the first layer for feature extraction to obtain the fourth feature map; the first layer includes at least one non-dilated convolutional layer;

[0029] The fourth feature map is processed by using the second and third layers to obtain the fifth feature map; the second layer includes at least one first residual structure; the third layer includes at least one first residual structure; the first residual structure includes at least one non-dilated convolutional layer.

[0030] The fifth feature map is subjected to feature extraction using the fourth and fifth layers to obtain the first feature map; the fourth layer includes at least one second residual structure; the fifth layer includes at least one second residual structure; the second residual structure includes at least one second dilated convolutional layer.

[0031] If the first layer uses non-dilated convolution, the extracted semantic information may not be rich enough, affecting the extraction of subsequent semantic information. Therefore, using non-dilated convolution in the first layer allows for the extraction of deeper semantic information. The second and third layers use a first residual structure for feature extraction. This first residual structure includes at least one non-dilated convolutional layer, which can extract deeper semantic information while reducing the loss of spatial information. The fourth and fifth layers use a second residual structure for feature extraction. Since the second residual structure includes at least one second dilated convolutional layer, it increases the receptive field while continuing to extract semantic information, further ensuring that the obtained first feature map retains more spatial information.

[0032] In some embodiments, the stride of the second perforated convolutional layer is 1.

[0033] The larger the stride, the easier it is to lose information during convolution. Therefore, in order to preserve as much spatial information as possible, the stride is set to 1.

[0034] Secondly, embodiments of the present invention also provide a feature extraction apparatus, comprising:

[0035] Processing unit, used for:

[0036] The target image is processed by convolutional layers to obtain a first feature map; the input feature map and the output feature map of at least one of the multiple convolutional layers have the same size.

[0037] The first feature map is input into multiple first dilated convolutional layers with different dilation coefficients for convolution processing to obtain N second feature maps representing different scales;

[0038] Based on the attention mechanism, N scale ratio coefficients are determined for each of the N second feature maps; wherein, the scale ratio coefficients are used to characterize the degree of influence of the corresponding second feature map on feature extraction;

[0039] Feature extraction is performed based on the N second feature maps and the scale ratio coefficients corresponding to each of the N second feature maps to obtain the final feature map of the target image.

[0040] Optionally, the plurality of convolutional layers includes at least one second dilated convolutional layer; the input feature map and the output feature map of the second dilated convolutional layer have the same size.

[0041] Optionally, the processing unit is specifically used for:

[0042] For any second feature map, the feature vector corresponding to the second feature map is obtained by performing channel-wise DW convolution and point-wise PW convolution on the second feature map.

[0043] By using a fully connected layer to extract spatial information weights from the N feature vectors corresponding to the N second feature maps, N scale coefficients corresponding to the N second feature maps are obtained.

[0044] Optionally, the processing unit is specifically used for:

[0045] For any second feature map, the scale ratio coefficient corresponding to the second feature map is used to weight the second feature map to obtain the third feature map corresponding to the second feature map;

[0046] The N third feature maps corresponding to the N second feature maps are concatenated, and the concatenated feature maps are convolved to obtain the final feature map of the target image.

[0047] Optionally, the processing unit is specifically used for:

[0048] The first feature map is input into N-1 first dilated convolutional layers with different dilation coefficients for convolution processing to obtain N-1 second feature maps;

[0049] A second feature map is obtained by performing average pooling, convolution, and interpolation on the first feature map.

[0050] Optionally, the processing unit is specifically used for:

[0051] The target image is input into the first layer for feature extraction to obtain the fourth feature map; the first layer includes at least one non-dilated convolutional layer;

[0052] The fourth feature map is processed by using the second and third layers to obtain the fifth feature map; the second layer includes at least one first residual structure; the third layer includes at least one first residual structure; the first residual structure includes at least one non-dilated convolutional layer.

[0053] The fifth feature map is subjected to feature extraction using the fourth and fifth layers to obtain the first feature map; the fourth layer includes at least one second residual structure; the fifth layer includes at least one second residual structure; the second residual structure includes at least one second dilated convolutional layer.

[0054] Optionally, the stride of the second perforated convolutional layer is 1.

[0055] Thirdly, embodiments of the present invention also provide a computing device, comprising:

[0056] Memory, used to store computer programs;

[0057] The processor is configured to invoke a computer program stored in the memory and execute the feature extraction method listed in any of the above methods according to the obtained program.

[0058] Fourthly, embodiments of the present invention also provide a computer-readable storage medium storing a computer-executable program for causing a computer to perform the feature extraction method listed in any of the above methods. Attached Figure Description

[0059] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0060] Figure 1 A schematic diagram illustrating a feature extraction method provided in an embodiment of the present invention;

[0061] Figure 2 A schematic diagram of an existing ResNet50 structure provided for an embodiment of the present invention;

[0062] Figure 3a A schematic diagram of a conventional residual structure bottleneck2 provided for an embodiment of the present invention;

[0063] Figure 3b A schematic diagram of a conventional residual structure bottleneck1 provided for an embodiment of the present invention;

[0064] Figure 4 This is a schematic diagram of a conventional convolutional layer and a dilated convolutional layer provided in an embodiment of the present invention;

[0065] Figure 5 This is a schematic diagram illustrating how to obtain N second feature maps according to an embodiment of the present invention;

[0066] Figure 6a A schematic diagram of an improved residual structure bottleneck2 provided in an embodiment of the present invention;

[0067] Figure 6b A schematic diagram of an improved residual structure bottleneck1 provided in an embodiment of the present invention;

[0068] Figure 7 This is a schematic diagram illustrating how a first feature map is obtained, as provided in an embodiment of the present invention.

[0069] Figure 8This is a schematic diagram illustrating how to obtain N second feature maps according to an embodiment of the present invention;

[0070] Figure 9a A schematic diagram of a DW convolution provided in an embodiment of the present invention;

[0071] Figure 9b A schematic diagram of a PW convolution provided in an embodiment of the present invention;

[0072] Figure 10 This is a schematic diagram illustrating how five second feature maps are processed to obtain five scale coefficients, as provided in an embodiment of the present invention.

[0073] Figure 11 This is a schematic diagram illustrating how a third feature map is obtained, as provided in an embodiment of the present invention.

[0074] Figure 12 This is a schematic diagram illustrating how a final feature map is obtained, as provided in an embodiment of the present invention.

[0075] Figure 13 This is a schematic diagram illustrating target detection using the final feature map, provided as an embodiment of the present invention.

[0076] Figure 14 A schematic diagram of a feature extraction device provided in an embodiment of the present invention;

[0077] Figure 15 This is a schematic diagram of the structure of a computer device provided in an embodiment of the present invention. Detailed Implementation

[0078] To make the objectives, implementation methods and advantages of this application clearer, the exemplary implementation methods of this application will be clearly and completely described below with reference to the accompanying drawings of the exemplary embodiments of this application. Obviously, the described exemplary embodiments are only some embodiments of this application, and not all embodiments.

[0079] Based on the exemplary embodiments described in this application, all other embodiments obtained by those skilled in the art without inventive effort are within the scope of protection of the appended claims. Furthermore, although the disclosures in this application are presented by way of one or more exemplary examples, it should be understood that each aspect of these disclosures can also constitute a complete implementation on its own.

[0080] It should be noted that the brief descriptions of terms in this application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of this application. Unless otherwise stated, these terms should be understood in their ordinary and common meaning.

[0081] The terms "first," "second," "third," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar or related objects or entities and do not necessarily imply a specific order or sequence, unless otherwise indicated. It should be understood that such terms can be used interchangeably where appropriate, for example, to implement the application in a sequence other than those given in the embodiments illustrated or described herein.

[0082] Furthermore, the terms “comprising” and “having”, and any variations thereof, are intended to cover but not exclusively include, for example, a product or device that includes a series of components is not necessarily limited to those that are explicitly listed, but may include other components that are not explicitly listed or that are inherent to such product or device.

[0083] With the continuous development of artificial intelligence, deep learning object detection algorithms are gradually replacing traditional object detection methods. Deep learning object detection algorithms train an object detection network using a large number of sample images. Then, an image is input into the trained network, and the network outputs the result of object detection for that image.

[0084] Deep learning object detection algorithms are mainly divided into two categories: The first category is candidate region-based object detection algorithms. These methods consist of two stages: first, generating bounding boxes of varying sizes on the original image, and then performing classification prediction and bounding box regression prediction on the features corresponding to each box. Examples include Region-based Convolutional Neural Networks (RCNN), Fast Region-based Convolutional Neural Networks (Fast-RCNN), and Faster Region-based Convolutional Neural Networks (Faster-RCNN). The second category is regression-based object detection algorithms. These algorithms eliminate the step of generating bounding boxes on the original image, directly using a single network for object classification prediction and bounding box regression. Representative methods include Single Shot MultiBox Detector (SSD) and YOLO. By reducing the region generation network, they significantly save time and achieve near real-time speed.

[0085] Multi-scale object detection has always been a challenge, as a single object detection network cannot simultaneously achieve high accuracy in detecting both large and small targets. For example, objects in an image may have different scales (objects farther from the camera appear at different scales than those closer); the same object may also appear at different scales in different images due to relative motion with the camera. Deep learning in object detection primarily extracts richer image features by repeatedly stacking convolutional layers. As the number of convolutional layers increases, the semantic information in the obtained image features becomes richer, leading to more accurate detection of large targets. However, spatial information is continuously weakened with increasing convolutional layers, resulting in increasingly inaccurate detection of small targets in the image.

[0086] Researchers have proposed a method for multi-scale object detection: utilizing features of different resolutions generated during convolution to detect objects at different scales. For example, high-dimensional features with lower resolution but richer semantic information obtained from a high-convolutional-layer approach and low-resolution features obtained from a low-convolutional-layer approach are scaled to the same size and then superimposed. The superimposed features are then used in subsequent object detection steps. In this way, high-dimensional features can be used to detect large-scale objects, and low-dimensional features can be used to detect small-scale objects. However, due to the low number of convolutional layers, the high-resolution low-level features lack sufficient semantic information. Therefore, object detection accuracy for small objects based on the superimposed features obtained in the above method remains low.

[0087] To address the aforementioned problems, embodiments of the present invention provide a feature extraction method that can balance the detection accuracy of targets at different scales. For example... Figure 1 As shown, it includes:

[0088] Step 101: Perform convolution processing on the target image using multiple convolutional layers to obtain a first feature map; the input feature map and output feature map of at least one of the multiple convolutional layers have the same size.

[0089] Step 102: Input the first feature map into multiple first dilated convolutional layers with different dilation coefficients for convolution processing to obtain N second feature maps representing different scales.

[0090] Step 103: Determine N scale ratio coefficients corresponding to the N second feature maps based on the attention mechanism; wherein the scale ratio coefficients are used to characterize the degree of influence of the corresponding second feature map on feature extraction.

[0091] Step 104: Based on the N second feature maps and the scale ratio coefficients corresponding to the N second feature maps, feature extraction is performed to obtain the final feature map of the target image.

[0092] The present invention improves the existing feature extraction network (hereinafter referred to as the first feature extraction network) to obtain a new feature extraction network (hereinafter referred to as the second feature extraction network), which is used to perform the feature extraction method provided in the present invention.

[0093] In step 101, the target image is input into the second feature extraction network provided in this embodiment of the invention. The first few layers of the second feature extraction network are convolutional layers. These convolutional layers are used to perform convolution processing on the target image in order to extract more information from the target image.

[0094] The first feature extraction network provided in this embodiment of the invention can be any neural network used for feature extraction, such as Visual Geometry Group Network (VGG) or Residual Neural Network (ResNet). Among them, ResNet can be various residual neural networks such as ResNet18, ResNet34, ResNet50, ResNet101 and ResNet152. The above are just examples, and this embodiment of the invention does not limit the scope of the invention.

[0095] This invention improves any of the first feature extraction networks mentioned above to obtain a second feature extraction network. Therefore, the basic details of the multiple convolutional layers mentioned in step 101, such as the number of convolutional layers, the size of the convolutional kernel in each layer, the stride, etc., can all be set with reference to the first few convolutional layers in any of the first feature extraction networks mentioned above. The difference is that this invention imposes the following restriction on the multiple convolutional layers: the input feature map and the output feature map of at least one of the multiple convolutional layers must have the same size.

[0096] The following section uses Faster-RCNN as an example to detail the scheme of step 101. In Faster-RCNN, the feature extraction network used is ResNet50, that is, the first feature extraction network is ResNet50.

[0097] Figure 2A schematic diagram of the existing ResNet50 architecture is shown. The first layer of this first feature extraction network includes a conventional convolutional layer with a 7×7 kernel, a stride (s) of 2, and padding (p) of 3. After processing by the first layer, a feature map is obtained with its size halved and its dimension increased to 64. Then, a max-pooling layer is applied, resulting in a feature map whose size is further halved while its dimension remains unchanged. The second layer of this first feature extraction network includes three residual structures, bottleneck2. The detailed structure of the bottleneck2 residual structure is shown below. Figure 3a As shown, the residual structure bottleneck2 includes three traditional convolutional layers. The figure details the kernel size, stride, and number of kernels (c) for each of these three layers. It can be seen that because the middle convolutional layer uses a stride of 1 and has padding, the input and output feature maps of the residual structure bottleneck2 are guaranteed to be of the same size. Figure 2 In the first feature extraction network, after passing through the second layer, a feature map with the same size but increased dimensionality by four times is obtained. The third layer of this first feature extraction network includes one residual structure bottleneck1 and three residual structures bottleneck2. The detailed structure of residual structure bottleneck1 is as follows... Figure 3b As shown, the main branch of the residual structure bottleneck1 includes three traditional convolutional layers, and each branch includes one traditional convolutional layer. The figure details the kernel size, stride, and number of kernels (c) of the four traditional convolutional layers. It can be seen that because the middle convolutional layer uses a stride of 2 and has padding, the output feature map of the residual structure bottleneck1 is halved in size compared to the input feature map. Figure 2 In the first feature extraction network, after passing through the third layer, a feature map with its size halved and dimensionality doubled is obtained. The fourth layer of this first feature extraction network includes one residual structure bottleneck1 and five residual structures bottleneck2. After passing through the fourth layer of the first feature extraction network, a feature map with its size halved and dimensionality doubled is obtained. The fifth layer of the first feature extraction network includes one residual structure bottleneck1 and two residual structures bottleneck2. After passing through the fifth layer of the first feature extraction network, a feature map with its size halved and dimensionality doubled is obtained. In the deep learning object detection algorithm Faster-RCNN, object detection is performed based on this final feature map.

[0098] In summary: the residual structure bottleneck1 is responsible for halving the feature map size and increasing the feature layer dimension, i.e., increasing the network dimension; bottleneck2 is responsible for extracting deeper semantic information without changing the feature map size, i.e., increasing the network depth.

[0099] In the embodiments of this invention, the conventional convolutional layer refers to a non-perforated convolutional layer as opposed to a perforated convolutional layer.

[0100] Equation 1 illustrates the relationship between the dimensions of the input and output feature maps when using traditional convolutional layers for convolution processing. out w represents the size of the output feature map. in is the size of the input feature map; k is the size of the convolution kernel; padding is the amount of padding; stride is the stride. It can be seen that by adjusting the relationship between k, padding, and stride, the input and output feature maps can be made to have the same size.

[0101]

[0102] Based on Formula 1, various improvements that can be made to the ResNet50 first feature extraction network in the embodiments of the present invention can be determined. For example: (1) The residual structure in any of the third, fourth and fifth layers is set as residual structure bottleneck2, because residual structure bottleneck2 can ensure that the size of the input feature map and the output feature map of the residual structure remains unchanged. In this way, the size of the input feature map and the output feature map of at least one of the above convolutional layers is the same. (2) The stride of the convolutional layer in the middle of residual structure bottleneck1 in any of the third, fourth and fifth layers is set to 1, and with the help of padding p, the size of the input feature map and the output feature map can also remain unchanged. In this way, the size of the input feature map and the output feature map of at least one of the above convolutional layers is the same.

[0103] The improved second feature extraction network still has five layers. After feature extraction by the second feature extraction network, the first feature map is obtained.

[0104] The aforementioned improvements to the feature extraction network's multiple convolutional layers ensure that the input and output feature maps of at least one convolutional layer have the same size, thus allowing the resulting first feature map to retain more spatial information. In contrast, the unmodified first feature extraction network performs multiple convolutions on the target image, continuously reducing its size and resolution, compressing spatial information, and potentially causing small targets to disappear. This undoubtedly reduces the accuracy of subsequent small target detection.

[0105] In step 102, the first feature map is input into multiple first dilated convolutional layers with different dilation coefficients for convolution processing to obtain N second feature maps representing different scales.

[0106] The first feature map retains a significant amount of spatial information, which may include both large-scale targets (hereinafter referred to as large targets) and small-scale targets (hereinafter referred to as small targets). To further extract more useful information, feature extraction will continue on the first feature map. However, if traditional convolutional layers are used for convolution, a considerable amount of spatial information will still be lost.

[0107] Therefore, this embodiment of the invention employs multiple first-dilated convolutional layers with different dilation coefficients to perform convolution processing on the first feature map. This embodiment of the invention does not limit the number of first-dilated convolutional layers, for example, 3, 4, 5, etc. This embodiment of the invention does not limit the size of the convolution kernels of the first-dilated convolutional layers. This embodiment of the invention does not limit the number of convolution kernels of the first-dilated convolutional layers. This embodiment of the invention does not limit the stride of the first-dilated convolutional layers. This embodiment of the invention does not limit the dilation coefficient of each first-dilated convolutional layer, for example, it can be 2, 4, 8, 10…

[0108] Equation 2 illustrates the relationship between the sizes of the input and output feature maps when using dilated convolutional layers. Here, r is the dilation coefficient. It can be seen that by adjusting the relationship between r, k, padding, and stride, the same size can be achieved for both the input and output feature maps.

[0109]

[0110] However, voided convolutional layers have a larger receptive field than traditional convolutional layers, which can further preserve spatial information. Figure 4 A schematic diagram of a traditional convolutional layer and a dilated convolutional layer is shown. The convolutional kernel is also 3×3 in size, but the dilated convolutional layer has a larger receptive field due to the presence of holes. The number of holes = r-1, therefore... Figure 4 The void number of the void convolutional layer shown is 1, and the expansion coefficient is 2.

[0111] The embodiments of this invention do not limit the stride. For example, it can be 1, 2, etc. As the stride increases, it is easier to lose information during convolution. Therefore, in order to ensure that as much spatial information as possible is preserved, the stride is generally set to 1.

[0112] If the stride is 1 and the kernel size is 3×3, i.e., k=3, then when r=padding, the size of the input feature map and the size of the output feature map are the same.

[0113] Figure 5A schematic diagram of a possible method for obtaining N second feature maps is shown. The first feature map is input into four first dilated convolutional layers with dilation coefficients of 4, 8, 16, and 24 for convolution processing to obtain four second feature maps.

[0114] The size of the second feature map remains the same as that of the first feature map, thus preserving more spatial information.

[0115] For example in Figure 5 In this model, first-dilated convolutional layers with different dilation coefficients have different receptive fields. First-dilated convolutional layers with a dilation coefficient of 4 are better at extracting features from small targets; first-dilated convolutional layers with a dilation coefficient of 24 are better at extracting features from large targets. Setting more and more dilation coefficients results in greater dispersion and the ability to extract features from targets of various scales.

[0116] Furthermore, since the first feature map retains more spatial information, continuing feature extraction based on the first feature map with more spatial information retention is beneficial to improving the detection accuracy of small targets. When continuing feature extraction on the first feature map, the first feature map is input into a first dilated convolutional layer with different dilation coefficients for convolution processing. The first dilated convolutional layer with different dilation coefficients can generate multiple second feature maps with different receptive fields. In other words, multiple second feature maps can represent the features of targets at different scales.

[0117] In step 103, N scale ratio coefficients are determined for each of the N second feature maps based on an attention mechanism. These scale ratio coefficients reflect the degree to which the corresponding second feature map contributes to feature extraction.

[0118] Each second feature map is convolved, and then spatial information weights are extracted based on the feature vectors obtained after each convolution, thus obtaining the scale ratio coefficient corresponding to each second feature map. By extracting the scale ratio coefficient, the focus on different target images can be adaptively adjusted. For example, if a target image contains only small targets, the scale ratio coefficient extraction determines that small targets should be the focus when extracting features from that target image, as they correspond to larger scale ratio coefficients. Conversely, if a target image contains many large targets, the scale ratio coefficient extraction determines that large targets should be the focus when extracting features from that target image, as they correspond to larger scale ratio coefficients. In this way, the second feature network provided in this embodiment of the invention can adaptively adjust the focus for different target images.

[0119] In step 104, feature extraction is performed based on the N second feature maps and the scale ratio coefficients corresponding to each of the N second feature maps to obtain the final feature map of the target image.

[0120] The scale ratio coefficient corresponding to each second feature map reflects the degree to which that second feature map contributes to feature extraction. For example, if the scale ratio coefficient of a certain second feature map is small, it means that the target at the scale corresponding to that second feature map occupies a small proportion in the target image and should not occupy a large proportion in the final feature map.

[0121] Each second feature map is fused with its respective scale coefficient using a certain operation, and the fused feature map is directly used as the final feature map. Alternatively, the fused feature map can be convolved again to extract more information and obtain the final feature map. This embodiment of the invention does not limit this approach. The "certain operation" here can be multiplication or other operations.

[0122] Multiple convolutional layers are used to process the target image. Since the input and output feature maps of at least one of the convolutional layers are of the same size, the resulting first feature map retains more spatial information. Continuing feature extraction based on this first feature map, which retains more spatial information, improves the detection accuracy of small targets. During further feature extraction, the first feature map is fed into first dilated convolutional layers with different dilation coefficients. These layers generate multiple second feature maps with different receptive fields, meaning they can represent features of targets at different scales. An attention mechanism is used to determine N scale ratio coefficients corresponding to each of the N second feature maps. These coefficients, obtained through the attention mechanism, reflect the degree to which the corresponding second feature map contributes to feature extraction. By extracting these scale ratio coefficients, the focus on different target images can be adaptively adjusted. Finally, feature extraction is performed based on the N second feature maps and their respective scale ratio coefficients to obtain the final feature map. The above method for obtaining the final feature map not only deeply extracts the semantic information of the target image, but also pays attention to the spatial information of targets at different scales during the feature extraction process. It satisfies the detection accuracy of large targets while improving the detection accuracy of small targets, thus taking into account the detection needs of multi-scale targets.

[0123] In some embodiments, the plurality of convolutional layers involved in step 101 include at least one second dilated convolutional layer; the input feature map and the output feature map of the second dilated convolutional layer have the same size.

[0124] That is, instead of using a traditional convolutional layer, a dilated convolutional layer with a larger receptive field is used when obtaining the first feature map. As shown in Equation 2, by designing the relationship between k, r, padding, and stride, the input and output feature maps of this dilated convolutional layer can be made to have the same size.

[0125] The embodiments of this invention do not limit the stride. For example, it can be 1, 2, etc. As the stride increases, it is easier to lose information during convolution. Therefore, in order to ensure that as much spatial information as possible is preserved, the stride is generally set to 1.

[0126] If the stride is 1 and the kernel size is 3×3, i.e., k=3, then when r=padding, the size of the input feature map and the size of the output feature map are the same.

[0127] The following is a detailed description of the method for determining the first feature map in step 101 through a specific embodiment.

[0128] Figure 6a The improved residual structure bottleneck2 is shown. Figure 6b The improved residual structure bottleneck1 is shown. Figure 6a As shown, the middle convolutional layer uses a second dilated convolutional layer with an dilation coefficient of r. Using a second dilated convolutional layer increases the receptive field, which is beneficial for extracting semantic information while preserving more spatial information. For example... Figure 6b As shown, the middle convolutional layer also uses a second dilated convolutional layer with an expansion coefficient of r, and the stride is set to 1. The expansion coefficients of the improved residual structure bottleneck1 and the improved residual structure bottleneck2 can be designed by those skilled in the art based on experience and requirements. They can be the same or different, and the embodiments of the present invention do not impose any restrictions on this.

[0129] For ease of description, in this embodiment of the invention, the unmodified residual structures bottleneck1 and bottleneck2 are referred to as the first residual structure, and the modified residual structures bottleneck1 and bottleneck2 are referred to as the second residual structure. The first residual structure includes at least one non-dilated convolutional layer; the second residual structure includes at least one second dilated convolutional layer.

[0130] It is worth noting that the difference between the first and second dilated convolutional layers lies only in their application. The dilated convolutional layer used in step 101 is the second dilated convolutional layer, while the dilated convolutional layer used in step 102 is the first dilated convolutional layer. The dilation coefficient, kernel size, number of kernels, stride, and padding parameters can be the same or different for both; there are no restrictions on this. Furthermore, while each dilated convolutional layer used in step 101 is referred to as a second dilated convolutional layer, this does not mean that the dilation coefficient, kernel size, number of kernels, stride, and padding parameters are the same for each layer; these parameters can be freely designed. Similarly, each dilated convolutional layer used in step 102 is referred to as a first dilated convolutional layer, but this does not mean that the dilation coefficient, kernel size, number of kernels, stride, and padding parameters are the same for each layer; these parameters can be freely designed.

[0131] Figure 7 This paper presents a network structure for obtaining the first feature map, based on an improvement of ResNet50. As shown in the figure, the first layer of this feature extraction network includes a traditional convolutional layer with a 7×7 kernel, a stride (s) of 2, and padding (p) of 3. After processing by the first layer, a fourth feature map is obtained, with its size halved and its dimension increased to 64. Then, a max-pooling layer is applied, resulting in a feature map whose size is further halved while its dimension remains unchanged. The second layer of this feature extraction network includes at least one first residual structure, namely, the unimproved residual structure bottleneck2. This yields a feature map with its size unchanged but its dimension increased to four times its original size. The third layer of this feature extraction network includes at least one first residual structure, namely, the unimproved residual structures bottleneck1 and bottleneck2. This yields a fifth feature map with its size halved and its dimension increased to twice its original size. Figure 7 In this version, the above structure remains unchanged compared to the original ResNet50. Only the fourth and fifth layers are improved.

[0132] The fourth layer of this feature extraction network includes at least one second residual structure, namely, the improved residual structures bottleneck1 and bottleneck2. This yields a feature map with the same size but twice the original dimension. The fifth layer of this feature extraction network also includes at least one second residual structure, namely, the improved residual structures bottleneck1 and bottleneck2. This yields a first feature map with the same size but twice the original dimension.

[0133] As can be seen, the size of the first feature map does not change in the last two layers, so more spatial information can be preserved. Furthermore, the use of dilated convolutional layers increases the receptive field, which also helps to preserve more spatial information.

[0134] In the fourth layer, the expansion coefficient of the second residual structure bottleneck1 is set to 1, and the expansion coefficient of the second residual structure bottleneck2 is set to 2; in the fifth layer, the expansion coefficient of the second residual structure bottleneck1 is set to 2, and the expansion coefficient of the second residual structure bottleneck2 is set to 2. The expansion coefficient can be freely set by those skilled in the art, and there are no restrictions on it.

[0135] Of course, the embodiments of the present invention are not limited to such Figure 7 Feature extraction can be performed in the manner shown. Alternatively, only the fifth layer can use the second residual structure, while the remaining layers remain unchanged. Or, the third, fourth, and fifth layers can all use the second residual structure, while the remaining layers remain unchanged. These are merely examples.

[0136] However, if the first layer uses non-dilated convolution, the extracted semantic information may not be rich enough, affecting the extraction of subsequent semantic information. Therefore, using non-dilated convolution in the first layer can extract deeper semantic information. The second and third layers use a first residual structure for feature extraction. The first residual structure includes at least one non-dilated convolutional layer, which can extract deeper semantic information while reducing the loss of spatial information. The fourth and fifth layers use a second residual structure for feature extraction. Since the second residual structure includes at least one second dilated convolutional layer, it increases the receptive field while continuing to extract semantic information, further ensuring that the obtained first feature map retains more spatial information.

[0137] The following is a detailed description of the method for obtaining N second feature maps in step 102 through a specific embodiment. Specifically, it includes: inputting the first feature map into N-1 first dilated convolutional layers with different dilation coefficients for convolution processing to obtain N-1 second feature maps; and obtaining 1 second feature map by performing average pooling, convolution processing, and interpolation processing on the first feature map.

[0138] While dilated convolution can increase the receptive field, the presence of holes may lead to the loss of crucial information. Therefore, this approach not only uses a first dilated convolutional layer to convolve the first feature map but also applies average pooling to it. This compresses the feature map, simplifying the network's computational complexity while preserving key information. Further convolution is then performed to extract more features, and interpolation ensures that the resulting second feature map has the same size as the one obtained through dilated convolution.

[0139] Figure 8 A flowchart illustrating the process of obtaining N second feature maps is shown. As shown, the first feature map of size 60×60×2048 obtained in step 101 is input into four dilated convolutional layers with dilation coefficients of 4, 8, 16, and 24 for convolution processing, resulting in four second feature maps. The convolutional kernel size is 3, the padding p is equal to the dilation coefficient, and the stride is 1. Adaptive average pooling with an output size of 1 is also used to compress the first feature map, simplifying the network's computational complexity while retaining key information. Further convolution processing is then performed to extract features; bilinear interpolation ensures that the second feature map has the same size as the second feature map obtained through dilated convolution.

[0140] The following is a detailed description of the method for obtaining the scale ratio coefficients of the N second feature maps in step 103 through a specific embodiment. Specifically, it includes: for any second feature map, performing channel-wise DW convolution and point-wise PW convolution on the second feature map to obtain the corresponding feature vector; and extracting spatial information weights from the N feature vectors corresponding to the N second feature maps through a fully connected layer to obtain the N scale ratio coefficients corresponding to each of the N second feature maps.

[0141] Figure 9a A schematic diagram of DW convolution is shown. Figure 9b A schematic diagram of PW convolution is shown. DW convolution can integrate and extract information within each channel, while PW convolution can extract and integrate information between channels.

[0142] Figure 10 This diagram illustrates the process of processing five second feature maps to obtain five scale coefficients. Each second feature map undergoes DW convolution, PW convolution, and global average pooling. The global average pooling uses five kernels, reducing the number of channels to five. This yields a 1×1×5 feature vector for that second feature map. The five feature vectors are concatenated to obtain a 1×1×25 feature vector, which is then passed through two fully connected layers to generate the final scale coefficients. The first fully connected layer has 25 neurons, and the second has 5 neurons. The final 1×1×5 scale coefficients are then compared with the five second feature maps. Figure 1 One-to-one correspondence.

[0143] DW convolution can integrate and extract information within each channel, while PW convolution can extract and integrate information between channels. By using fully connected layers to extract spatial information weights from the N feature vectors corresponding to the N second feature maps, N scale coefficients are obtained. In this way, the extraction of scale information is completed with the fewest parameters.

[0144] The following is a detailed description of the method for obtaining the final feature map in step 104 through a specific embodiment. Specifically, it includes: for any second feature map, weighting the second feature map using the scale coefficient corresponding to the second feature map to obtain a third feature map corresponding to the second feature map; concatenating the N third feature maps corresponding to the N second feature maps, and performing convolution processing on the concatenated feature map to obtain the final feature map of the target image.

[0145] Figure 11 A schematic diagram illustrating the obtaining of the third feature map is shown. The third feature map corresponding to any second feature map is obtained by multiplying the scale factor of that second feature map by the second feature map itself.

[0146] Figure 12 The diagram illustrates the process of obtaining the final feature map. The five third feature maps are concatenated, and then a 1×1 convolution is applied to the concatenated feature map to extract and fuse the feature information again, resulting in the final feature map. This final feature map is used in subsequent object detection processes.

[0147] Figure 13 This document illustrates a flowchart for object detection using the final feature map. In Faster R-CNN, the previously generated feature map is processed using 3×3 convolutions to generate a one-dimensional vector for each kernel. Two fully connected layers are then used to perform bounding box regression and foreground / background classification on this one-dimensional vector. Non-maximum suppression is applied to the bounding boxes based on their classification scores, resulting in two thousand candidate boxes. The feature map is then truncated using these candidate boxes, and ROI Pooling is used to resize the feature map to a uniform 7×7 size. Finally, the same fully connected layers are used to classify the object type and predict bounding box regression parameters based on the information contained in the feature map.

[0148] Based on the same technological concept Figure 14 An exemplary embodiment of the present invention illustrates the structure of a feature extraction apparatus that can perform a feature extraction process.

[0149] like Figure 14 As shown, the device specifically includes:

[0150] Processing unit 1401 is used for:

[0151] The target image is processed by convolutional layers to obtain a first feature map; the input feature map and the output feature map of at least one of the multiple convolutional layers have the same size.

[0152] The first feature map is input into multiple first dilated convolutional layers with different dilation coefficients for convolution processing to obtain N second feature maps representing different scales;

[0153] Based on the attention mechanism, N scale ratio coefficients are determined for each of the N second feature maps; wherein, the scale ratio coefficients are used to characterize the degree of influence of the corresponding second feature map on feature extraction;

[0154] Feature extraction is performed based on the N second feature maps and the scale ratio coefficients corresponding to each of the N second feature maps to obtain the final feature map of the target image.

[0155] Optionally, the plurality of convolutional layers includes at least one second dilated convolutional layer; the input feature map and the output feature map of the second dilated convolutional layer have the same size.

[0156] Optionally, the processing unit 1401 is specifically used for:

[0157] For any second feature map, the feature vector corresponding to the second feature map is obtained by performing channel-wise DW convolution and point-wise PW convolution on the second feature map.

[0158] By using a fully connected layer to extract spatial information weights from the N feature vectors corresponding to the N second feature maps, N scale coefficients corresponding to the N second feature maps are obtained.

[0159] Optionally, the processing unit 1401 is specifically used for:

[0160] For any second feature map, the scale ratio coefficient corresponding to the second feature map is used to weight the second feature map to obtain the third feature map corresponding to the second feature map;

[0161] The N third feature maps corresponding to the N second feature maps are concatenated, and the concatenated feature maps are convolved to obtain the final feature map of the target image.

[0162] Optionally, the processing unit 1401 is specifically used for:

[0163] The first feature map is input into N-1 first dilated convolutional layers with different dilation coefficients for convolution processing to obtain N-1 second feature maps;

[0164] A second feature map is obtained by performing average pooling, convolution, and interpolation on the first feature map.

[0165] Optionally, the processing unit 1401 is specifically used for:

[0166] The target image is input into the first layer for feature extraction to obtain the fourth feature map; the first layer includes at least one non-dilated convolutional layer;

[0167] The fourth feature map is processed by using the second and third layers to obtain the fifth feature map; the second layer includes at least one first residual structure; the third layer includes at least one first residual structure; the first residual structure includes at least one non-dilated convolutional layer.

[0168] The fifth feature map is subjected to feature extraction using the fourth and fifth layers to obtain the first feature map; the fourth layer includes at least one second residual structure; the fifth layer includes at least one second residual structure; the second residual structure includes at least one second dilated convolutional layer.

[0169] Optionally, the stride of the second perforated convolutional layer is 1.

[0170] Based on the same technical concept, embodiments of this application provide a computer device, such as... Figure 15 As shown, it includes at least one processor 1501 and a memory 1502 connected to at least one processor. In this embodiment, the specific connection medium between the processor 1501 and the memory 1502 is not limited. Figure 15 Taking the connection between processor 1501 and memory 1502 via a bus as an example, the bus can be divided into address bus, data bus, control bus, etc.

[0171] In this embodiment of the application, the memory 1502 stores instructions that can be executed by at least one processor 1501. By executing the instructions stored in the memory 1502, at least one processor 1501 can perform the steps of the above-described feature extraction method.

[0172] The processor 1501 is the control center of the computer device, capable of connecting to various parts of the computer device via various interfaces and lines. It performs feature extraction by running or executing instructions stored in the memory 1502 and accessing data stored in the memory 1502. In some embodiments, the processor 1501 may include one or more processing units. The processor 1501 may integrate an application processor and a modem processor, wherein the application processor primarily handles the operating system, user interface, and applications, while the modem processor primarily handles wireless communication. It is understood that the modem processor may not be integrated into the processor 1501. In some embodiments, the processor 1501 and the memory 1502 may be implemented on the same chip; in some embodiments, they may also be implemented on separate chips.

[0173] Processor 1501 can be a general-purpose processor, such as a central processing unit (CPU), digital signal processor, application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component, capable of implementing or executing the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly manifested as being executed by a hardware processor, or executed by a combination of hardware and software modules within the processor.

[0174] Memory 1502, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. Memory 1502 may include at least one type of storage medium, such as flash memory, hard disk, multimedia card, card-type memory, random access memory (RAM), static random access memory (SRAM), programmable read-only memory (PROM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), magnetic storage, magnetic disk, optical disk, etc. Memory 1502 can be any other medium capable of carrying or storing desired program code in the form of instructions or data structures that can be accessed by a computer, but is not limited thereto. In the embodiments of this application, memory 1502 can also be a circuit or any other device capable of implementing storage functions for storing program instructions and / or data.

[0175] Based on the same technical concept, embodiments of the present invention also provide a computer-readable storage medium storing a computer-executable program, which is used to cause a computer to perform the feature extraction method listed in any of the above methods.

[0176] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0177] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0178] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0179] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0180] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.

Claims

1. A method for feature extraction, characterized in that, include: The target image is input into the first layer of multiple convolutional layers for feature extraction, resulting in the fourth feature map. The first layer includes at least one non-hole convolutional layer; The second and third layers of the plurality of convolutional layers are used to extract features from the fourth feature map to obtain the fifth feature map; the second layer includes at least one first residual structure; the third layer includes at least one first residual structure; the first residual structure includes at least one non-dilated convolutional layer; The fourth and fifth layers of the plurality of convolutional layers are used to extract features from the fifth feature map to obtain a first feature map; the fourth layer includes at least one second residual structure; the fifth layer includes at least one second residual structure; the second residual structure includes at least one second dilated convolutional layer; the input feature map and the output feature map of the at least one second dilated convolutional layer have the same size; The first feature map is input into multiple first dilated convolutional layers with different dilation coefficients for convolution processing to obtain N second feature maps representing different scales; Based on the attention mechanism, N scale ratio coefficients are determined for each of the N second feature maps; wherein, the scale ratio coefficients are used to characterize the degree of influence of the corresponding second feature map on feature extraction; Feature extraction is performed based on the N second feature maps and the scale ratio coefficients corresponding to each of the N second feature maps to obtain the final feature map of the target image.

2. The method as described in claim 1, characterized in that, The N scale coefficients corresponding to the N second feature maps are determined based on an attention mechanism, including: For any second feature map, the feature vector corresponding to the second feature map is obtained by performing channel-wise DW convolution and point-wise PW convolution on the second feature map. By using a fully connected layer to extract spatial information weights from the N feature vectors corresponding to the N second feature maps, N scale coefficients corresponding to the N second feature maps are obtained.

3. The method as described in claim 1, characterized in that, Feature extraction is performed based on the N second feature maps and their respective scale coefficients to obtain the final feature map of the target image, including: For any second feature map, the scale ratio coefficient corresponding to the second feature map is used to weight the second feature map to obtain the third feature map corresponding to the second feature map; The N third feature maps corresponding to the N second feature maps are concatenated, and the concatenated feature maps are convolved to obtain the final feature map of the target image.

4. The method as described in claim 1, characterized in that, The first feature map is input into multiple first dilated convolutional layers with different dilation coefficients for convolution processing to obtain N second feature maps, including: The first feature map is input into N-1 first dilated convolutional layers with different dilation coefficients for convolution processing to obtain N-1 second feature maps; A second feature map is obtained by performing average pooling, convolution, and interpolation on the first feature map.

5. The method as described in claim 2, characterized in that, The step size of the second holed convolutional layer is 1.

6. A feature extraction apparatus, characterized in that, include: Processing unit, used for: The target image is input into the first layer of multiple convolutional layers for feature extraction to obtain the fourth feature map; the first layer includes at least one non-dilated convolutional layer; The second and third layers of the plurality of convolutional layers are used to extract features from the fourth feature map to obtain the fifth feature map; the second layer includes at least one first residual structure; the third layer includes at least one first residual structure; the first residual structure includes at least one non-dilated convolutional layer; The fourth and fifth layers of the plurality of convolutional layers are used to extract features from the fifth feature map to obtain a first feature map; the fourth layer includes at least one second residual structure; the fifth layer includes at least one second residual structure; the second residual structure includes at least one second dilated convolutional layer; the input feature map and the output feature map of the at least one second dilated convolutional layer have the same size; The first feature map is input into multiple first dilated convolutional layers with different dilation coefficients for convolution processing to obtain N second feature maps representing different scales; Based on the attention mechanism, N scale ratio coefficients are determined for each of the N second feature maps; wherein, the scale ratio coefficients are used to characterize the degree of influence of the corresponding second feature map on feature extraction; Feature extraction is performed based on the N second feature maps and the scale ratio coefficients corresponding to each of the N second feature maps to obtain the final feature map of the target image.

7. A computing device, characterized in that, include: Memory, used to store computer programs; A processor is configured to invoke a computer program stored in the memory and execute the method according to any one of claims 1 to 5 in accordance with the obtained program.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer-executable program for causing a computer to perform the method according to any one of claims 1 to 5.